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Chairs^  Message 


The  IEEE  Workshop  on  Real-Time  Operating  Systems  and  Software  is  a  forum  that 
covers  recent  advances  in  real-time  computing  —  a  field  that  is  becoming  an  essential 
part  of  computer  science  and  engineering.  It  brings  together  practitioners  and 
researchers  from  academia,  industry,  and  government,  to  explore  the  best  current  ideas 
on  real-time  software  and  operating  systems,  and  to  evaluate  the  maturity  and 
directions  of  real-time  system  technology.  As  the  demand  for  the  functionalities  and 
reliabilities  of  real-time  systems  continue  to  grow,  our  intellectual  and  engineering 
abilities  are  being  challenged  to  come  up  widi  practical  solutions  to  the  problems  faced 
in  design  and  development  of  complex  real-time  systems. 

The  interest  in  this  important  topic  is  confirmed  by  the  high  number  of  quality 
submissions.  Following  the  tradition  of  previous  RTOSS  workshops,  parallel  sessions 
are  avoided  in  order  to  give  participants  the  opportunity  to  be  involved  in  interactions 
with  speakers  and  panelists,  and  to  exchange  opinions  with  all  other  participants.  As  a 
consequence,  many  good  position  papers  had  to  be  rejected. 

The  technical  program  covers  a  wide  range  of  issues,  such  as  scheduling,  operating 
systems,  communications,  timing  analysis,  system  design,  concurrency  control,  and 
formal  methods.  Besides  the  various  sessions,  the  program  includes  three  panel  sessions 
to  address  important  issues  on  real-time  programming  languages,  education,  and  real¬ 
time  scheduling.  In  addition,  Nancy  Leveson  from  the  University  of  Washington  will 
deliver  an  invited  talk  on  software  safety. 

Many  people  worked  hard  to  make  this  year's  RTOSS  workshop  a  success.  The 
Program  Committee  members  carefully  reviewed  and  discussed  every  submitted  paper, 
and  made  the  difficult  decisions  on  which  papers  to  accept.  We  also  would  like  to  thank 
the  authors  of  all  the  submitted  papers.  Special  thanks  go  to  Alicen  Smith  for  managing 
the  administrative  activities,  and  Bob  Werner  of  the  IEEE  Computer  Society  for  the 
publication  of  this  proceedings.  Finally,  we  are  grateful  to  the  IEEE  Computer  Society 
Technical  Committee  on  Real-Time  Systems,  the  Office  of  Naval  Research,  and  the 
Etepartments  of  Computer  Science  at  the  University  of  Virginia  and  the  University  of 
Washington. 

Welcome  to  Seattle! 
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Abstract 

Both  predictable  interprocessor  synchronization  and 
fast  internet  response  are  required  for  real-time  systems 
constructed  using  asymmetric  shared-memory  multiproces¬ 
sors.  This  paper  points  out  the  problem  that  conventional 
spin  lock  algorithms  cannot  satitfy  both  requirements  at 
the  same  time.  To  solve  this  problem,  we  have  proposed 
an  algorithm  which  is  an  extension  of  queueing  spin  locks 
modified  to  be  preemptablefor  servicing  interrupts  [1].  In 
this  paper,  we  propose  an  improved  algorithm  that  mini¬ 
mizes  the  recovering  overhead  from  an  interrupt  service. 
We  also  demonstrate  that  the  proposed  algorithms  have 
required  properties  through  performance  measurement. 

1  Introduction 

In  many  applications  of  high  performance  real-time 
systems,  a  large  number  of  external  devices  such  as  sensors, 
actuators,  and  network  controllers  are  connected  to  a  system 
and  the  system  is  required  to  respond  to  the  external 
events  from  the  devices  within  predefined  and  usually 
short  time-bounds.  To  meet  this  requirement,  asymmetric 
multiprocessors  in  which  each  device  is  handled  by  a  fixed 
processor  are  often  adopted. 

In  order  to  realize  real-time  systems  using  shared- 
memory  multiprocessors,  predictable  interprocessor  syn¬ 
chronization  mechanisms  are  of  primary  importance.  In 
addition  to  adopting  a  real-time  scheduling  algorithm  with 
resource  constraints  or  a  real-time  synchronization  proto¬ 
col,  the  execution  time  of  the  underlying  mutual  exclusion 
mechanism  using  spin  locks  must  be  bounded^ 

In  asymmetric  shared-memory  multiprocessors,  each 
processor  is  required  to  achieve  fast  and  predictable  re¬ 
sponse  to  interrupt  requests,  because  extonal  events  are 
notified  to  each  processor  in  the  form  of  interrupts.  How¬ 
ever,  each  processor  cannot  respond  to  external  interrupts 
in  short  latency  with  conventional  bounded  spin  lock  algo¬ 
rithms. 

To  solve  this  problem,  we  have  proposed  an  algorithm 
which  is  an  extension  of  queueing  spin  locks  modified  to 

'  We  assume  that  the  access  time  of  the  shared  bus  (or  interconoectioD 
network)  is  bounded  in  this  paper. 


be  preemptable  for  servicing  interrupts  [1].  With  the  algo¬ 
rithm,  an  upper  bound  on  the  time  to  acquire  and  release  an 
interprocessor  lock  can  be  given  when  no  interrupt  request 
occurs,  and  fast  response  to  interrupt  requests  is  achieved. 
However,  the  algorithm  has  a  shortcoming  that  a  processor 
possibly  has  to  re-execute  the  lock  acquiring  routine  from 
the  beginning  after  it  services  an  interrupt  request.  In 
schedulability  analysis,  this  re-execution  overhead  must  be 
added  to  the  interrupt  service  time. 

In  this  paper,  we  propose  an  improved  algorithm  that 
minimizes  this  overhead.  We  also  demonsfrate  that  the 
proposed  algorithms  have  required  properties  through  per¬ 
formance  measurement. 

2  Spin  locks  and  interrupt  latency 

In  this  paper,  we  assume  that  atomic  read-modify- 
write  operations  on  a  single  word  of  shared  memory  (e.g. 
tesLand-set,  fetch^d^tore  (swap),  fetch^tLadd,  and 
compare-and-swap)  are  supported  in  hardware. 

In  order  to  bound  the  time  until  a  processor  acquires 
an  interprocessor  lock,  the  duration  that  each  processor 
holds  the  lock  must  be  bounded  as  well  as  the  number  of 
contending  processors  that  the  processor  must  wait  for.  The 
latter  condition  can  be  met  with  ticket  locks  or  queueing 
locks  [2],  with  which  the  turn  that  a  processor  acquires  a 
lock  is  determined  when  it  begins  waiting  for  the  lock.  To 
satisfy  the  former  condition,  the  relationship  with  interrupt 
services  must  be  considered. 

In  asymmetric  multiprocessor  systems,  interrupt  ser¬ 
vices  for  external  devices  are  requested  for  each  processor. 
When  multiple  devices  are  connected  to  a  processor,  inter¬ 
rupt  requests  from  them  are  usually  raised  independently 
and  the  maximum  time  to  service  all  of  the  requests  be¬ 
comes  unbounded  or  very  long.  Consequently,  in  order 
to  give  a  practical  bound  on  the  duration  that  a  processor 
holds  a  lock,  interrupt  services  should  be  inhibited  for  that 
duration. 

On  the  other  hand,  in  order  to  realize  a  system  with 
fast  response  to  external  events,  each  processor  must  be 
able  to  service  external  interrupts  with  short  latency  time. 
Particularly,  when  the  scalability  of  the  system  is  an  impor- 


0-8186-5710-3/94  $3.00  ©  1994  IEEE 


2 


tant  issue,  the  worst-case  interrupt  latency  should  be  given 
indq>endently  of  the  number  of  processors  in  the  system. 

Here  a  problem  arises  in  deciding  whether  interrupts 
should  be  disabled  first  or  an  interprocessor  lock  should 
be  acquired  first.  When  acquiring  an  interprocessor  lock 
precedes  disabling  interrupts,  interrupts  may  be  serviced 
while  the  processor  holds  the  lock,  and  the  condition  that 
interrupt  services  should  be  inhibited  while  a  processor 
holds  a  lock  is  not  satisfied.  If  acquiring  a  lock  follows 
disabling  interrupts,  on  the  other  hand,  the  interrupt  mask 
time  includes  the  time  to  acquire  the  lock  and  its  upper 
bound  heavily  depends  on  the  number  of  processors. 

One  method  to  solve  this  problem  is  the  following.  The 
processor  first  disables  interrupts  and  tries  to  acquire  the 
lock.  If  it  fails  to  acquire  the  lock,  the  processor  probes 
interrupt  requests  before  it  retries  to  acquire  the  lock.  When 
interrupt  requests  are  detected,  it  suspends  trying  to  acquire 
the  lock,  enables  interrupts,  and  services  them. 

Ibst-and-set  locks  can  be  extended  easily  with  this 
method.  Ticket  locks  and  queueing  locks,  on  the  other 
hand,  cannot  be  extended  similarly. 

3  Queueing  locks  with  preemption 

In  all  spin  lock  algorithms  that  can  give  an  upper 
bound  on  the  time  until  a  processor  acquires  a  lock,  a 
processor  modifies  some  shared  variable  and  reserves  its 
tuni  to  acquire  the  lock  when  it  begins  waiting  for  the 
lock.  When  its  turn  comes,  the  lock  is  passed  to  the 
processor  by  another.  If  the  processor  simply  branches  to 
an  interrupt  handler  while  waiting  for  the  lock,  it  cannot 
begin  to  execute  the  critical  section  immediately  after  the 
lock  is  passed  to  the  processor,  and  makes  the  contending 
processors  wait  wastefully  until  the  interrupt  service  is 
finished. 

Consequently,  when  a  processor  begins  to  service  in¬ 
terrupts  while  waiting  for  a  lock,  it  must  inform  others 
that  it  is  servicing  interrupts  and  should  not  be  passed  the 
lock.  The  processor  trying  to  release  the  lock  checks  if 
the  succeeding  processor  is  servicing  interrupts.  If  the 
succeeding  one  is  found  to  be  servicing  interrupts,  its  turn 
to  acquire  the  lock  is  canceled  or  deferred,  and  the  lock  is 
passed  to  the  next  in  line. 

Original  algorithm 

We  have  applied  the  above  scheme  to  the  MCS  lock, 
a  list-based  queueing  lock  algorithm  [2],  and  proposed  a 
queueing  lock  algorithm  with  preemption  [1].  Some  other 
spin  lock  algorithms  can  be  extended  similarly.  Recently, 
R.  W.  Wisniewski  et  al.  have  proposed  a  similar  algo¬ 
rithm  for  improving  the  average  performance  of  multipro- 
grammed  (non-real-time)  systems  [3].  Craig’s  algorithm 
can  also  support  the  same  preemption  scheme  [4]. 

In  the  algorithm,  if  the  processor  tfying  to  release 
the  lock  (Po)  finds  that  the  succeeding  processor  (Pi) 
is  servicing  interrupts,  Po  deoueues  Pi  from  the  waiting 


queue  and  passes  the  lock  to  a  successor  of  Pi .  When 
only  Pi  is  wailing  for  the  lock,  Po  makes  the  waiting 
queue  empty.  Po  informs  Pi  that  Pi  is  dequeued  using  a 
shared  variable.  When  Pi  finishes  the  interrupt  service,  it 
checks  whether  it  has  been  dequeued  during  the  interrupt 
service  or  not.  If  it  has  been  dequeued,  it  re-executes  the 
lock  acquiring  routine  from  the  beginning.  Otherwise,  it 
resumes  waiting  for  the  lock. 

When  a  processor  is  dequeued  and  re-executes  the  lock¬ 
acquiring  routine,  the  waiting  time  after  the  processor  first 
links  itself  to  the  queue  until  it  branches  to  the  interrupt 
handler  is  wasted.  When  the  schedulability  of  the  system 
is  analyzed,  this  re-execution  overhead  should  be  added  to 
the  interrupt  service  time.  Below,  we  present  an  improved 
algorithm  which  is  devised  to  reduce  this  overhead. 

Improved  algorithm 

The  re-execution  overhead  can  be  reduced  with  the 
following  method.  When  the  processor  releasing  the  lock 
(Po)  finds  that  the  succeeding  processor  (Pi)  is  servicing 
interrupts,  Po  leaves  Pi  in  the  waiting  queue  instead  of 
dequeueing  it.  Pq  removes  the  processor  to  which  to  pass 
the  lock  from  the  queue  using  the  method  adopted  in  the 
prioritized  queueing  spin  lock  appeared  in  [5].  When  P\ 
finishes  interrupt  services,  it  simply  resumes  waiting  for  the 
lock  in  its  original  position.  Therefore,  the  overhead  which 
must  be  added  to  the  interrupt  service  time  in  schedulability 
analysis  is  minimized. 

A  difficulty  occurs  when  all  processors  in  the  waiting 
queue  are  servicing  interrupts.  To  handle  this  situation, 
a  global  lock  flag  is  introduced.  If  the  processor  trying 
to  release  the  lock  finds  that  all  processors  in  the  queue 
are  servicing  interrupts,  it  sets  the  global  lock  flag.  A 
processor  returning  from  interrupt  services  tries  to  get  the 
global  lock  with  the  same  method  as  with  test-and-set 
locks.  If  it  succeeds  getting  the  lock,  it  removes  itself  from 
the  waiting  queue.  As  the  processor  needs  to  know  the 
top  processor  in  the  queue  to  remove  itself,  the  processor 
releasing  the  global  lock  must  pass  the  information  in  some 
shared  variable.  It  is  also  necessary  for  a  processor  to  check 
the  global  lock  flag  once,  after  it  links  itself  at  the  end  of 
the  queue,  because  it  is  possible  that  all  the  processors  in 
the  queue  are  servicing  interrupts  and  the  global  lock  is  set. 

Pseudo-code  for  the  improved  algorithm  appears  in 
Fig.  1  and  2.  In  these  figures,  the  keyword  shared 
indicates  that  only  one  instance  of  the  variable  is  allocated 
and  shared  in  the  system.  Other  variables  are  allocated 
for  each  processor  and  located  in  its  local  memory.  The 
right  hand  side  of  the  and  operator  is  assumed  to  be 
evaluated  only  if  its  left  hand  side  is  true.  Fetch  jand^tore 
reads  the  memory  addressed  by  the  first  parameter,  returns 
the  contents  of  the  memory  as  its  value,  and  atomically 
writes  the  second  parameter  to  the  memory.  CAS,  the 
abbreviation  of  compare^d-swap,  first  reads  the  memory 
pointed  to  by  the  first  parameter  and  compares  its  contents 
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type  qnode  =  record 

next,  prev;  pointer  to  qnode; 

locked;  (Released,  Locked,  Preempted,  Dequeueing) 

end; 

type  lock  =  record 

last:  painter  to  qnode; 
glock:  pointer  to  qnode 


//  global  shared  data. 

ihared  ear  L:  lock; 

//  LJast  and  L.glock  ate  initialized  to  NIL. 

procedure  dequeue(entry,  pred,  top;  pointer  to  qnode) 
var  succ:  pointer  to  qnode; 
succ  :=  entry— >next; 
if  succ  =  NIL  then 
pred— «next  :=  NIL; 

if  CAS(&(L.last),  entry,  pred)  then  goto  release  end; 

#  repetd  succ  ;=  entry— »next  until  succ  ^  NIL 

end; 

pred— ►next  :=  succ; 
succ— ►prev  :=  pred; 
release: 

entry— ►next  —  top; 
entry— ►locked  ;=  Released 

end; 

Fig.  1:  Improved  algorithm  (1) 

with  the  second  parameter.  If  they  are  equal,  the  function 
writes  the  third  parameter  to  the  memory  atomically  and 
returns  true.  Otherwise,  it  returns  false. 

In  this  pseudo-code,  the  glock  field  of  L  serves  both 
as  the  global  lock  flag  and  as  the  variable  to  pass  the  top 
processor  of  the  waiting  queue.  An  exponential  backoff 
scheme  is  adopted  to  get  the  global  lock  in  this  code 
to  reduce  the  number  of  shared-bus  .  msactions.  Two 
constant  parameters  cr  and  shoul  .)  je  tuned  for  each 
target  hardware  and  application. 

Though  there  are  two  non-local  spins  (marked  with 
#)  in  this  pseudo-code,  both  of  them  continue  during  the 
transient  state  afer  another  processor  writes  the  pointer 
to  its  queue  node  to  L.last  (successful  execution  of  the 
fetch..and^tore  operation  marked  with  (p)  and  until  it 
writes  non-NIL  value  to  the  next  field  of  its  predecessor 
(marked  with  (p),  and  their  effect  is  not  significant. 

We  have  adopted  the  MCS  lock  as  the  base  algorithm 
in  this  section.  The  ETFO  version  of  Craig’s  algorithm  [4] 
can  be  extended  similarly. 

4  Performance  evaluation 

The  effectiveness  of  the  two  queueing  spin  lock  algo¬ 
rithms  with  preemption,  the  original  one  in  [1]  (called 
QL/Pl,  in  this  section)  and  the  improved  one  presented 
in  Rg.  1  and  2  (QL/K),  are  examined  throu^  perfor¬ 
mance  evaluation.  The  performance  of  the  algorithms  is 
compared  with  the  MCS  lock  without  inhibiting  interrupts 
(QL/ei),  the  MCS  lock  during  interrupts  inhibited  (QL/di), 


//  local  data  (allocated  for  each  processor), 
var  I:  qnode; 

var  pred,  succ,  top:  painter  to  qnode; 
var  interval,  i:  integer; 

I.next  :=  NIL; 
disable JnUrrupIs-, 

0  pred  ;=  fetchjui(Lstore(&(L.Iast),  &I); 
if  pred  =  NIL  then  goto  acquired  end; 

//  enqueue  myself. 

I.prev;=-pred; 

I.locked  Locked; 

0  pred-^next  ;=  &I; 

i  :=  1 ;  //  check  the  global  lock  once, 

interval  :=  oo ;  //  never  expires, 

while  (I.locked  yt  Released)  do 
if  interruptj'equested  and 

CAS(&(I.locked),  Locked,  Preempted)  then 
enable Jnierrupts-, 

//  interrupt  service. 
disable JnJemipts: 

I.locked  ;=  Locked; 
i:=l; 

interval  ;=  a 

end; 

i:=i-l; 

ifi=Otlicn 

//  check  the  global  lock  and  try  to  get  if  it  is  set 
top  ;=  L.glock; 

if  top  #  NO.  and  CAS(<h(L.glock),  top,  NIL)  then 
if  top  ^  &I  then  dequeue(&I,  I.prev,  top)  end; 
goto  acquired 
end; 

i  :=  interval; 
interval  :=  interval  x  0 

Old 

end; 

acquired: 

// 

//critical  section. 

// 

succ .—  I.next; 

If  succ  =  NIL  then 

//  try  to  make  the  queue  empty. 

If  CAS(&(L.Iast),  &I,  NIL)  then  goto  exit  end; 
repeat  succ  ;=  I.next  nntil  succ  ^  NIL 
end; 

//  try  to  pass  the  lock  to  the  successor. 

If  CAS(d:(succ-«lockBd),  Locked,  Released)  then  goto  exit  end; 

top  :=  succ; 

repeat 

pred  :=  succ; 
succ  :=  pred— ►next; 

If  succ  =  NIL  then 

//  set  the  global  lock. 

L.glock  :=  top; 

//  check  if  pr^  is  really  the  last  processor. 

If  L.last  =  pred  then  goto  exit  «id; 

//  try  to  withdraw  the  global  lock. 

It  -'CAS((k(L.glock),  top,  NIL)  then  goto  exit  end; 

#  repeat  succ  :=  pred— ►next  ontll  succ  ^  NIL 

end; 

iintU  CAS(d:(succ— ►locked).  Locked,  Dequeueing); 
dequeue(succ,  pred,  top); 
exit: 

enable  Jnterrupts-, 

Fig.  2:  Improved  algorithm  (2) 


for  i  :=  1  to  NoLoop  do 
0  acquireJockjtndjiiisableJntermpls\ 

// 

//  critical  section. 

// 

releaseJock; 

0  enableJnlerrupts-, 
randomjlelay 
end; 

Rg.  3:  Measurement  program  skeleton 

and  the  test-and-set  lock  with  preemption  with  constant 
delay  (T&S/P)2. 

Evaluation  environment 

We  have  used  a  shared-bus  multiprocessor  system  for 
the  evaluation.  The  shared  bus  is  based  on  the  VME- 
bus  specification,  and  each  processor  node  consists  of  a 
20  MHz  Gmicro/200  microprocessor,  which  is  rated  at 
^proximately  10  MIPS,  1  MB  of  local  memory,  and  some 
interfaces.  The  local  memory  can  be  accessed  from 
other  processors  through  the  shared  bus.  No  cache  memory 
is  equipped.  The  program  code  and  the  data  area  for  each 
processor  are  placed  in  the  local  memory  of  the  processor. 
Global  shared  data  (e.g.  L  in  Fig.  1)  is  placed  in  the  local 
memory  of  the  master  processor,  which  does  not  execute 
spin  locks. 

The  Gmicro/200  microprocessor  supports  the  com- 
pareuand-swap  instruction  but  not  fetch-ancLstore.  In  our 
experiments,  the  fetch-and.jtore  operation  was  emulated 
using  the  compare..aR(Lswtqi  instruction  and  a  retry  loop. 
As  the  VMEbus  has  only  four  pairs  of  bus  request/grant 
lines,  processors  are  classified  into  four  classes  by  the  bus 
request  line  they  use.  The  round-robin  arbitration  scheme 
is  adopted  among  classes  and  the  static  priority  scheme  is 
applied  among  processors  belonging  to  a  same  class. 

Measurement  method 

Each  processor  executes  the  code  presented  in  Fig.  3 
whilepoiodic  interrupt  requests  are  raised  on  the  processor. 
The  execution  time  of  a  critical  region  (the  region  between 
(p  and  @  in  Rg.  3)  is  measured  for  each  execution,  and 
its  distributions  when  the  processor  services  no  interrupt 
request  during  the  region  and  when  it  services  an  interrupt 
are  collected.  The  interrupt  latency  is  also  measured  for 
each  interrupt  service  and  its  distribution  is  obtained. 

Inside  the  critical  section,  a  processor  accesses  the 
shared  bus  some  number  of  times  (for  making  the  effect  of 
bus  traffic  explicit)  and  waits  for  a  while  using  empty  loops. 
Mthout  spin  locks,  the  execution  time  of  the  critical  region 

^Pwt  studies  show  that  a  test-and-set  lock  has  good  scalability  with 
exponential  backoff  [2].  However,  because  the  lock  acquisition  time 
varies  widely  with  exponential  backoff,  it  is  inappropriate  for  real-time 
systems.  This  conjecture  was  also  confirmed  through  our  experiments. 


is  about  40  /ts  including  some  overhead  for  obtaining  the 
execution  time  of  the  region.  In  order  to  change  timing 
conditions,  each  processor  waits  for  a  random  time  before 
it  re-enters  the  critical  region  {randomjklay  in  Fig.  3). 
The  average  time  of  the  random  delay  is  about  40  /rs. 

Empty  loops  are  alsu  included  in  the  interrupt  handler 
in  addition  to  the  routine  for  obtaining  interrupt  latency 
time.  The  total  execution  time  of  the  interrupt  handler  is 
about  80  n&.  The  period  of  interrupt  requests  is  about  5  ms. 
The  exact  length  of  the  period  is  varied  in  0-2%  for  each 
processor. 

Performance  metric 

In  real-time  systems,  the  effectiveness  of  algorithms 
should  not  be  evaluated  with  their  average  performance 
but  with  their  worst-case  execution  (or  response)  times. 
However,  in  the  case  of  spin  lock  algorithms,  worst-case 
times  cannot  be  obtained  through  experiments  because  of 
unavoidable  non-determinism  in  multiprocessor  systems. 
Therefore,  in  place  of  worst-case  times,  we  have  adopted 
p-reliable  times,  the  time  within  which  a  processor  finishes 
executing  a  critical  region  (or  responds  to  an  interrupt 
request)  with  probability  p,  as  a  performance  metric.  In  the 
following  section,  we  show  the  evaluation  results  when  p 
is  0.999  (i.e.  99.9%). 

Evaluation  results 

Rg.  4  presents  the  99.9%-reliable  execution  lime  of 
the  critical  region  (when  no  interrupt  is  serviced  on  the 
processor  during  the  region)  as  the  number  of  processors  is 
increased  from  one  to  eight.  With  QL/Pl  and  QL/P2,  the 
execution  time  of  the  critical  region  increases  linearly  with 
the  number  of  processors,  and  the  algorithms  are  found  to 
be  scalable.  (^L/ei  exhibits  poorer  performance  because 
preceding  processors  service  interrupt  requests  during  the 
critical  region. 

In  Fig.  5,  the  interrupt  latency  time  is  nearly  independent 
of  the  numbo^  of  processors  with  QL/Pl  and  QL/P2.  With 
QL/di  on  the  contrary,  the  interrupt  latency  becomes  long 
as  the  numbo'  of  processors  increases. 

Firom  these  observations,  it  is  demonstrated  that  QL/Pl 
and  QL/P2  can  give  a  practical  upper  bound  on  the  time  to 
acquire  and  release  an  interprocessor  lock  while  achieving 
fast  response  to  interrupt  requests.  The  other  algorithms 
cannot  satisfy  these  two  requirements  at  the  same  time. 

The  overall  performance  of  QL/P2  is  a  little  worse  than 
QL/Pl,  because  the  number  of  shared-bus  transactions 
is  large  with  QL/P2  and  because  doubly  linked  queue  is 
necessary.  The  advantage  of  QL/P2  tqtpears  in  Fig.  6  which 
presents  the  99.9%-reliable  execution  time  of  the  critical 
region  when  an  interrupt  is  serviced  during  the  region. 
When  the  number  of  processors  is  large,  the  recovering 
overhead  from  interrupt  services  is  much  smaller  in  QL/P2 
than  in  QL/Pl. 


Fig.  4:  99.9%-reliable  exec,  time  of  critical  region 
(wben  no  interrupt  is  serviced) 


number  of  processors 


Fig.  5:  99.9%-reliable  interrupt  latency 


(when  an  interrupt  is  serviced) 


Fig.  7:  Average  exec,  time  of  critical  region 


Hnally,  in  order  to  examine  the  average  performance  of 
the  algorithms,  we  present  the  average  execution  time  of 
the  critical  region  (when  no  interrupt  is  serviced  during  the 
region)  in  Fig.  7. 

5  Conclusion 

Conventional  spin  lock  algorithms  cannot  satisfy  two 
impOTtant  requirements  for  real-time  systems  using  asym¬ 
metric  shared-memory  multiprocessors,  predictable  spin 
locks  and  fast  interrupt  response,  at  the  same  time.  In 
this  p^ier,  we  propose  a  improved  spin  lock  algorithm 
that  can  give  an  upper  bound  on  the  time  to  acquire  and 
release  an  interprocessor  lock  while  realizing  fast  response 
to  interrupt  requests.  To  evaluate  their  effectiveness,  we 
have  measured  their  performance  through  experiments  and 
confirmed  that  the  algorithms  have  the  required  properties. 

We  are  currently  designing  a  real-time  kernel  specifica¬ 
tion  called  ITRON-MP  and  implementing  it  experimentally 
[6].  It  remains  as  a  future  work  to  adopt  the  algorithms  in 
the  implementation  and  to  evaluate  the  algorithms  in  real 
applications. 
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Abstract 

Continuous-media  applications  require  more  efficient  and 
flexible  support  from  real-time  threads  than  traditional 
real-time  systems.  It  includes  functionalities  such  as  the  dy¬ 
namic  management  of  thread  attributes  and  the  support  of 
multiple  thread  models.  In  this  paper,  we  will  describe  the 
design  and  implementation  of  user-level  real-time  threads 
on  the  RT-Afach  micro  kernel.  Since  they  are  implemented 
at  user-level,  both  of  the  fast  management  of  thread  at¬ 
tributes  and  the  support  of  multiple  thread  models  are  pos¬ 
sible. 

1  Introduction 

Continuous-media  applications  require  more  efficient  and 
flexible  support  from  real-time  threads  than  traditional  real¬ 
time  systems  [4, 12, 13].  The  “flexible  support”  includes 
the  following  two  functionalities: 

•  the  dynamic  management  of  thread  attributes, 

•  the  support  of  multiple  thread  models. 

The  dynamic  management  of  thread  attributes  is  necessary 
because  system  resource  utilization  in  workstations  and 
network  environments  is  changing  every  minute.  Tuning 
attributes,  such  as  start  time,  deadline  and  period,  are  parts 
of  thre«l  attributes.  For  example,  if  there  are  too  many 
threads  for  a  system  to  satisfy  thdr  timing  requirements, 
some  threads  may  be  able  to  run  more  infrequently  or  with 
shorter  execution  time.  As  another  example,  if  network 
traffic  is  crowded  and  an  plication  cannot  receive  data  at 
the  expected  rate,  threads  of  the  application  should  change 
their  behavior  to  follow  the  rate  of  data  received. 

Ability  to  support  multiple  thread  models  is  also  im¬ 
portant.  Since  there  is  no  standard  way  to  implement 
continuous-media  {plications,  programmm  may  be  able 
to  choose  one  of  the  existing  thread  packages  or  may  want 
to  create  a  new  one.  For  instance,  one  programme’  finds 
it  is  useful  to  use  a  periodic  thread  to  process  continuous 
data,  while  another  programmer  would  like  to  use  threads 
which  have  their  start  time  and  deadline  and  to  create  a  new 
thread  for  each  data  chunk. 


Our  goal  is  to  realize  high  po'formance  user-level  real¬ 
time  threads  because  only  uso'-level  real-time  threads  can 
achieve  the  above  functionalities.  Since  they  are  imple¬ 
mented  at  user-level,  both  of  the  fast  management  of  thread 
atfributes  and  the  support  of  multiple  thread  models  are 
possible.  The  next  section  describes  the  previous  work. 
Section  3  discusses  design  issues  of  user-level  real-time 
threads,  and  Section  4  proposes  a  software  architecture  for 
user-level  real-time  threads.  Sec  '  S  describes  the  current 
status  with  some  po-formance  fig  ves,  and  Section  6  gives 
the  conclusion. 

2  Previous  Work 

Real-time  threads  have  been  developed  as  konel  enti¬ 
ties.  Existing  real-time  kernels,  such  as  ARTS  [10]  and 
RT-Mach  [11],  realize  their  real-time  threads  as  kernel- 
provided  threads.  Since  threads  are  implemented  in  the 
kernel,  primitives  like  real-time  synchronization  and  func¬ 
tions  to  set  thread  attributes  are  also  implemented  in  the 
kernel.  Thus,  the  thread  operations  cost  so  expensive  that 
the  performance  is  sometimes  unaccptable  for  dynamic 
environments  requiring  the  dynamic  management  of  thread 
attributes  [13].' 

First-class  user-level  threads  were  developed  to  solve 
scheduling  problems  occurred  in  usor-level  threads  envi¬ 
ronments  where  an  entire  task  is  blocked  when  a  user-level 
thread  is  blocked  in  the  kernel.  Scheduler  Activations  [1] 
and  the  first-class  user-level  threads  of  the  Psyche  opo-at- 
ing  system  [7]  provide  the  mechanisms  to  avoid  the  above 
problem.  Both  of  them  are  implemented  on  the  parallel 
computo’s  to  exploit  the  ability  of  parallelism  of  the  under¬ 
ling  hardware.  Thus,  they  have  no  functionality  to  manage 
timing  attributes  of  threads. 

Split-level  scheduling  [4]  provides  user-level  real-time 
threads  through  the  shared  user/kemel  structures  with  the 

'In  our  pnvkK't  experience  with  ARTS  [8],  context  switdiing  in 
the  lame  address  space  costs  3;isec  for  user-level  threads  and  26^isec 
for  kernel-provided  threads.  Synchronization  costs  9/isec  for  user-level 
threads  and  46fjsec  for  kernel-provided  threads.  Since  the  dynamic  man¬ 
agement  of  thread  attrilNites  introduces  many  operations:  in  the  same  ad¬ 
dress  space  (described  in  Section  4.3),  this  fsahirt  is  i  il. 
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split  kernel-level  and  user-level  schedulers.  This  is  im¬ 
plemented  on  a  uniprocessor,  and  shared  memory  is  ex- 
tensivdy  used  to  pass  information  between  a  user-level 
scheduler  and  the  kanel.  Each  usa-level  thread  has  its 
logical  arrival  time  and  deadline,  and  the  threads  are  sched¬ 
uled  by  the  deadline/workahead  scheduling  policy  based  on 
their  timing  attributes.  Split-level  scheduling  proposes  a 
new  mechanism  for  asynchronous  communication  to  avoid 
threads  blocked  in  the  kernel.  Since  the  split-level  schedul¬ 
ing  was  developed  to  handle  continuous-media  efficiently, 
iu  goal  is  similar  to  ours.  However,  it  does  not  have  a  notion 
of  dynamic  rebinding  of  timing  attributes.  The  timing  at¬ 
tributes  of  threads  is  managed  by  the  split  kernel-level  and 
user-level  schedulers  coopo-atively,  while  our  user-level 
real-time  threads  manage  timing  attributes  of  threads  using 
a  timer  which  is  a  separate  instance  from  a  thread.  This 
feature  increases  the  flexibility  of  user-level  schedulers. 


3  Design  Issues 

User-level  real-time  threads  must  be  treated  as  flrst-class 
user-level  threads  [1,4, 7]  since  user-level  real-time  threads 
need  to  be  scheduled  as  correctly  as  kernel-provided 
threads.  In  this  section,  we  first  describe  the  design  de¬ 
cisions  to  iiiq;)lement  flrst-class  user-level  threads.  Then, 
several  design  issues  are  discussed. 

3.1  First-Class  User-Level  Threads 

Mechanisms  proposed  by  previous  inqilementations  of 
flrst-class  user-level  threads  [1, 4, 7]  were  examined.  Then, 
the  following  mechanisms  were  chosen  for  implementing 
our  user-level  real-time  thread  model. 

Upcall:  The  k»nel  has  to  notify  a  user-level  sched¬ 
uler  of  events  which  were  occurred  in  the  kernel  and  af- 
fea  a  scheduling  decision.  The  kernel  upcalls  a  user-level 
scheduler,  and  the  user-level  scheduler  processes  events  and 
choose  the  next  thread  to  run.  This  mechanism  is  used  on 
all  implementations  of  flrst-class  user-level  threads  while 
they  c^l  it  differently. 

Shared  kemel/user  data  structures:  There  are  two  dif¬ 
ferent  approaches  to  pass  events  to  a  user-level  scheduler. 
Shared  kemelAiser  data  structures  are  used  for  flrst-class 
uso'-level  threads  [7]  and  split-level  scheduling  [4].  Sched¬ 
uler  activations  [1]  upcall  different  entry  points  of  a  user- 
level  schedulo^  each  of  which  is  provided  for  the  corre¬ 
sponding  type  of  events.  We  chose  to  use  shared  kemelArso’ 
data  structures  since  they  can  be  used  to  pass  information 
of  threads  from  user-level  schedulers  to  the  kernel,  such  as 
priorities  and  timing  attributes.  It  can  also  provide  a  simple 
way  to  pass  events  asynchronously. 


Creation  of  a  new  virtual  processor:  Scheduler  activa¬ 
tions  [  1]  create  a  new  virtual  processor  when  the  current  one 
is  blocked  in  the  kernel,  but  others  do  not  do  so.  We  chose  to 
create  a  new  virtual  processor.  There  are  two  main  reasons 
for  this  decision.  One  reason  is  that  our  platform,  RT-Mach 
[11],  requires  it.  Virtual  processors  are  implemented  us¬ 
ing  kernel-provided  threads.  Since  there  are  many  places 
where  a  thread  structure  is  referenced  in  the  kmiel,  it  is 
too  hard  to  modify  them  to  cope  with  a  user-level  thread. 
Another  reason  is  that  the  number  of  interactions  be 
a  user-level  scheduler  and  the  kmiel  can  be  reduce, 
example,  when  a  thread  is  unblocked  in  the  kemei,  . 
event  is  notified  to  a  user-level  schedulo'.  If  the  user-level 
scheduler  decides  to  run  the  unblocked  thread,  it  issues  a 
system  primitive  to  resume  it.  We  can  avoid  such  a  heavy 
interaction  if  the  current  virtual  process^  is  preserved  and 
a  new  one  is  used  to  upcall  a  user-level  scheduler. 

3.2  Dynamic  Creation  of  Virtual  Processors 

The  dynamic  creation  of  a  new  virtual  processor  sometimes 
takes  a  long  time,  and  it  can  be  a  source  of  the  unpredictabil¬ 
ity.  If  there  is  an  extra  virtual  processor  which  is  not  in 
use,  it  can  be  used  instead  of  the  current  one.  Then,  the 
dynamic  creation  is  not  necessary.  Therefore,  when  a  user- 
level  scheduler  is  initializing  its  status,  it  asks  the  kernel 
to  create  several  kernel-provided  threads.  Those  threads 
are  maintained  in  the  kernel,  and  are  used  later  as  virtual 
processors  when  a  running  virtual  processor  is  blocked  in 
the  kernel. 

The  number  of  kernel-provided  threads  created  at  initial¬ 
ization  is  fixed.  If  all  of  them  are  used  and  blocked  in  the 
kernel,  the  kernel  needs  to  create  a  new  virtual  processor 
dynamically  or  just  leaves  it  blocked.  For  bard  real-time 
applications,  the  bdiaviors  are  analyzed  and  the  necessary 
number  of  virtual  processors  is  found.  For  soft  real-time 
^plications,  the  dynamic  management  of  the  number  of 
virtual  processors  is  necessary. 

33  Priority  Consistency 

User-level  real-time  threads  are  managed  and  scheduled  by 
a  user-level  scheduler,  while  virtual  processors  are  sched¬ 
uled  by  the  kernel-level  scheduler.  User-level  real-time 
threads  and  virtual  processors  have  their  own  priority  data. 
Thus,  they  are  managed  indqjendently.  Since  user-level 
threads  are  multiplexed  on  a  virtual  processor,  the  prior¬ 
ity  of  the  current  user-level  thread  must  be  reflected  to 
the  (Hlority  of  its  virtual  processor  to  schedule  the  virtual 
processor  correctly. 

The  problem  which  arises  h«e  is  that  the  current  pri¬ 
ority  which  needs  to  be  reflected  to  the  virtual  processor 
changes  independently  of  the  kernel  because  user-level 
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Figure  1;  Blocking  Thread  in  the  Kouel 


threads  switches  at  user-level.  Therefore,  a  mechanism 
which  makes  the  priority  data  of  a  virtual  processor  up¬ 
dated  is  necessary. 

3A  Timing  Management 

User-level  real-time  threads  also  have  timing  attributes  such 
as  a  start  time,  a  deadline  and  so  on.  Usually,  the  tim¬ 
ing  management  is  done  in  the  kernel  using  a  clock  de¬ 
vice  which  interrupts  the  kernel  at  intervals  of  very  short 
period.^  Since  user-level  real-time  threads  are  managed  by 
a  user-level  scheduler,  a  user-level  scheduler  needs  to  man¬ 
age  their  timing  attributes.  This  requires  for  a  user-level 
scheduler  the  close  cooperation  with  the  kernel. 

A  user-level  scheduler  needs  to  tell  the  konel  when  it 
would  like  to  be  notified.  Since  the  dynamic  management 
of  thread  attribute  requires  fast  rebinding  of  the  timing 
attribute,  system  primitives  cost  too  much  to  do  so.  Thus, 
a  shared  kemel/user  data  structure  is  used  to  share  such 
infonnation.  A  user-level  scheduler  maintains  timing  data 
in  it,  and  the  kernel  checks  it.  If  the  time  which  a  user-level 
scheduler  needs  a  notification  comes,  then  the  konel  sends 
an  event  to  it 


4  Software  Ardiitecture 

In  this  section,  we  first  describe  how  virtual  processors  are 
used  and  interact  with  user-level  real-time  thread^.  Then, 
mechanisms  for  user-level  timm  and  the  dynamic  man¬ 
agement  of  timing  attributes  are  discussed. 

4.1  Virtual  Processors 

There  are  the  following  three  types  of  virtual  processors: 

^lOmt  is  a  very  conmioa  value  for  current  workstations. 


Figure  2:  Unblocking  Thread  in  the  Kernel 


•  A  current  virtual  processor  is  currently  executing 
user-level  threads  in  an  address  space.  Only  this  type 
of  virtual  processors  can  run  at  uso'-level. 

•  A  kernel  virtual  processor  is  attached  to  a  specific 
user-level  thread,  which  is  blocked  in  the  konel.  It 
executes  only  in  the  kernel  because  user-level  threads 
running  at  user-level  must  be  multiplexed  on  the  cur¬ 
rent  virtual  processor. 

•  A  reserved  virtual  processor  is  waiting  to  become  a 
current  one.  One  of  them  is  used  when  a  current  one 
is  blocked  in  the  kernel. 

When  a  uso'-level  thread  is  blocked  in  the  kernel,  the 
current  virtual  processor,  which  is  executing  the  blocked 
thread,  becomes  a  kernel  virtual  processor.  Then,  one  of 
reserved  virtual  processors  is  taken  from  the  list,  and  be¬ 
comes  the  current  virtual  processor.  Finally,  the  new  cur¬ 
rent  virtual  processor  upcalls  the  user-level  scheduler.  (See 
Figure  1.) 

When  a  kernel  virtual  processor  is  unblocked,  it  is  sched¬ 
uled  by  the  kernel-level  scheduler  independently  of  the  cur¬ 
rent  virtual  processor.  When  a  kernel  virtual  processor  is 
about  to  exit  the  kernel,  it  passes  two  execution  contexts  to 
the  user-level  schedulo-.  One  is  for  the  usa-level  thread  on 
the  kernel  virtual  processor.  Another  is  for  the  user-level 
thread  on  the  current  virtual  processor,  which  is  preempted 
by  the  kernel  virtual  processor.  Then,  the  current  virtual 
processor  is  linked  in  the  list  of  resoled  virtual  processors, 
and  the  kernel  virtual  processor  becomes  the  new  current 
virtual  processor.  Finally,  the  new  current  virtual  processor 
upcalls  the  user-level  schedule.  (See  Figure  2.) 

4.1.1  Priority  Update 

To  make  the  priority  data  of  the  current  virtual  processor 
consistent  with  the  priority  of  the  current  user-level  thread, 
it  is  updated  in  the  following  cases: 

•  when  an  intmupt  is  occurred. 
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•  when  the  kernel-level  scheduler  Is  invoked, 

•  when  a  user-level  thread  waked  up  by  a  timer  has  a 
higher  {nlority  than  the  current  virtual  processor. 

At  each  interrupt.  The  priority  data  of  the  current  user- 
level  thread  is  copied  to  the  current  virtual  processor  in  the 
current  task.  Then,  the  kernel  checks  if  the  current  virtual 
processor  has  the  highest  priority.  If  it  doesn’t,  the  kernel 
invokes  the  highest  priority  kernel-provided  thread. 

When  the  kernel-level  schedule  is  invoked,  the  priority 
data  of  the  current  user-level  thread  is  copied  if  the  current 
kernel-provided  thread  is  a  virtual  processor.  Then,  we 
can  avoid  the  priority  inconsistency  if  a  user-level  thread  is 
switched  after  an  interrupt. 

We  (tiscuss  priority  update  which  is  necessary  when  a 
user-level  thread  waked  up  by  a  timer  in  Section  4.2.2. 

Hmer 

In  RT-Macb,  a  kernel-provided  timer  called  RT-Mach  Hmer 
[9]  is  already  iiiq)lemented  in  the  kernel  for  kernel-provided 
real-time  threads.  To  use  it  for  user-level  real-time  threads, 
several  modifications  are  necessary  to  interact  with  user- 
level  schedulers. 

It  is  possible  to  use  kernel-provided  timers  with  a  few 
modifications  if  one  timer  is  used  for  each  single  user-level 
real-time  thread  as  kernel-provided  real-time  threads.  This 
schone,  however,  causes  a  lot  of  kernel  interventions  since 
each  operation  on  a  timer  is  required  to  issue  a  system 
primitive.  Semantics  of  a  timer  is  also  limited  since  it 
is  ioqtlemented  in  the  kernel.  Then,  it  makes  difiicult  to 
achieve  our  goals. 

In  our  architecture,  a  user-level  scheduler  enqiloys  a  sin¬ 
gle  kernel-provided  timer  only  for  notification,  and  man¬ 
ages  user-level  timers  to  decide  what  is  necessary  to  do 
when  notified. 

4,2.1  Uaer>Levcl  Timer 

User-level  timers  are  managed  by  a  user-level  scheduler. 
A  user-level  timer  provides  a  kernel-provided  timer  with 
the  time  and  the  priority  data.  The  time  specifies  when  the 
uso’-level  scheduler  would  like  to  get  a  notification.  The 
priority  data  is  used  by  the  kernel  to  update  the  priority  data 
of  the  current  virtual  processor.  A  kernel-provided  timer 
uses  the  above  data  of  user-level  timers,  then  decides  when 
it  notifies  the  user-level  scheduler.  Since  data  of  user-level 
timers  is  written  by  a  user-level  scheduler  and  read  by  a 
kernel-provided  timer,  it  needs  to  be  placed  in  a  shared 
kernelAiser  dau  structure. 

Decoupling  user-level  threads  and  timos  makes  it  pos¬ 
sible  to  support  multiple  thread  models.  The  konel  just 
notifies  a  user-level  scheduler  when  it  needs  a  notification. 


This  mechanism  does  not  assume  any  model.  Thus,  user- 
level  schedulers  can  interpret  and  use  notifications  as  they 
wish. 

4.2.2  Thread  Wakeup  by  Timer 

When  a  user-level  thread  which  is  waked  up  by  a  timer  has 
the  highest  priority,  the  aurent  thread  is  preempted  and  the 
waked  up  thread  must  be  invoked.  This  is  the  same  case  as 
when  a  user-level  thread  is  unblocked  in  the  kernel.  Thus, 
the  kernel  does  the  same  operations  on  threads.  If  a  waked 
up  thread  does  not  have  the  highest  priority,  the  kernel  just 
notifies  the  event  to  the  user-level  scheduler. 

4  J  Dynamic  Management  of  Timing  Attribute 

The  dynamic  management  of  thread  timing  attributes  is 
archived  using  deadline  handlers  and  dynamic  rebinding  of 
thread  timing  attributes  [14]. 

A  deadline  handier  is  an  independent  thread  which  is 
attached  to  a  real-time  thread.  The  deadline  handler  of  a 
real-time  thread  is  invoked  when  the  deadline  of  the  real¬ 
time  thread  is  missed.  In  thedeadlinehandler,  it  can  resume 
the  real-time  thread  to  continue  the  rest  of  work  although 
the  deadline  is  missed,  or  it  can  abort  the  invocation  if  it  is 
meaningless  to  continue  the  work  after  the  deadline. 

When  a  system  becomes  overloaded  and  deadlines  of 
real-time  threads  start  being  missed,  their  deadline  handlers 
are  invoked.  In  such  case,  they  can  rebind  the  timing 
attributes  of  the  threads  dynamically  to  reduce  the  system 
load.  Dynamic  rebinding  of  thread  timing  attributes  resets 
timing  attributes  of  a  real-time  thread,  such  as  a  period  and 
a  deadline,  to  new  values.  The  new  values  become  valid 
firom  the  next  invocation. 

The  above  operations  are  all  processed  at  user-level. 
Thus,  user-level  real-time  threads  can  achieve  much  higher 
performance  than  kernel-provided  real-time  threads  since 
kernel  interventions  are  not  involved.  A  deadline  handler  is 
an  exan^le  of  mechanisms  for  the  dynamic  management. 
It  is  very  easy  to  add  new  features  to  a  user-level  schedulo-. 

5  Current  Status 

We  are  currently  implementing  uso'-level  real-time  threads 
on  RT-Mach  [1 1].  As  our  first  implementation  of  user-level 
thread  packages,  we  decided  to  modify  C-Threads  package 
[3].  Since  our  implementation  is  upper  compatible  with  the 
original  C-Threads  package,  applications  using  C-Threads 
can  also  benefit  firom  high  performance  of  first-class  user- 
level  threads. 

Thble  1  shows  the  performance  of  signal/wait  primitives 


to 


RTC-Thrcads 

C-Threads 

RT  Threads 

(user-level) 

(user-level) 

(kernel-provided) 

2S/isec 

38/isec 

170/isec 

Table  1:  SignaiyWait  Primitives 


null  function 

null  system 

null  system  call 

call 

call  (trap) 

(via  MIG) 

0.8/rsec 

S/isec 

72/rsec 

Table  2:  Basic  Operations  Performance 


of  our  real-time  version  of  C-Thrcads  (RTC-Threads),* 
original  C-Threads  and  kernel-provided  real-time  threads 
(RT  Threads).  The  programs  used  to  measure  the  per¬ 
formance  implement  a  producer/consumer  model  that  one 
thread  is  a  producer  and  another  thread  is  a  consumer.  The 
benchmarks  were  performed  on  a  Gateway2000  486DX2 
66MHz  system.  Table  2  shows  the  performance  of  basic 
operations  for  comparison. 

6  Summary 

The  goals  of  our  user-level  real-time  threads  are  the  dy¬ 
namic  management  of  thread  attributes  and  the  support  of 
multiple  thread  models.  We  showed  that  the  dynamic  man¬ 
agement  of  thread  attributes  can  be  achieved  by  realizing 
real-time  threads  at  user-level.  Introducing  the  user-level 
timer  mechanism  also  makes  the  support  of  multiple  thread 
models  possible. 

Our  user-level  real-time  threads  can  also  keeps  com¬ 
patibility  with  existing  kernel-provided  threads.  They  can 
coexist  in  the  same  environment,  and  existing  applications 
still  run  without  any  modification. 

The  current  real-time  thread  model  is  being  imple¬ 
mented,  and  more  accurate  and  various  performance  mea¬ 
surements  will  be  completed. 

Acknowledgments 

We  would  like  to  thank  members  of  Multimedia  Platform 
Project  for  their  various  conunents.  We  are  also  grateful 
to  ftof.  Tatsuo  Nakajinut  and  Mr.  Takuro  Kitayama  for 
providing  us  with  helpful  information  of  RT-Mach. 

References 


the  User-Level  Management  of  Parallelism.  In  Proceed¬ 
ings  of  the  1 3th  Symposium  on  Operating  System  Principle, 
October  1991. 

[2]  P.  Barton-Davis,  D.  McNamee,  R.  Vaswani,  and  E.D.  La- 
zowska.  Adding  Scheduler  Activations  to  Mach  3.0.  In 
Proceedings  of  the  USENDC  Mach  3rd  Symposium,  April 
1993. 

[3]  E.C.  Cooper  and  R.P.  Draves.  C  Threads.  Technkal  Report 
CMU-CS-S8-1S4,  School  of  Computer  Science,  Carnegie 
Mellon  University,  February  1988. 

[4]  R.Govindan  and  D.P.Anderstm.  Scheduling  and  IPC  Mech¬ 
anisms  for  Continuous  Media.  In  Proceedings  of  the  13th 
Symposium  on  Operating  System  Principle,  October  1991. 

[5]  D.  Golub,  R.  Dean,  A.  Forin,  and  R.  Rashid.  Unix  as  an 
Application  Program.  In  Proceedings  of  the  Usenix  Summer 
Conference, loot  1990. 

16]  R.G.  Heiitwich.  The  Role  of  Performance,  Scheduling,  and 
Resource  Reservation  in  Multimedia  System.  In  Proceed¬ 
ings  of International  Workshop  of  Operating  Systems  of  the 
90s  and  Beyond,  Lecture  Notes  in  Computer  Science  563, 
Springer- Veriag,  1991. 

[7]  B.D.  Marsh,  M.L.  Scott,  TJ.  LeBlanc,  and  E.P.  Markatos. 
First-Class  User-Level  Threads.  In  Proceedings  of  the  13th 
Symposium  on  Operating  System  Principle,  October  1991 . 

[8]  S.  Oikawa  and  H.  Tokuda.  User-Level  Real-Time  Threads: 
An  Approach  towards  High  Performance  Multimedia 
Threads.  In  Proceedings  of  the  4th  International  Work¬ 
shop  on  Network  and  Operating  System  Support  for  Digital 
Audio  and  Video,  November  1993. 

[9]  S.  Savage  and  H.  Tokuda.  RT-Mach  Timers:  Exporting 
Time  to  the  User.  In  Proceedings  of  the  USENIX  Mach  3rd 
Symposium,  April  1993. 

[10]  H.  Tokuda  and  C.W.  Mercer.  ARTS:  A  Distributed  Real- 
Time  Kernel.  ACM  Operating  Systems  Review,  1^1. 23,  No. 
3, 1989. 

[11]  H.  Tokuda,  T.  Nakajima,  and  P.  Rao.  Real-Time  Mach: 
Towards  a  Predictable  Real-Time  System.  In  Proceedings 
of  USENIX  Mach  Worb/iop,  October  1990. 

[12]  H.  Tokuda,  Y.  Tobe,  S.T.-C.  Chou,  and  J.M.F  Moura.  Con¬ 
tinuous  Media  Communication  with  Dynamic  QOS  Control 
Using  ARTS  with  an  FDDI  Network.  In  Proceedings  of 
ACM SIGCOMM’92,  August  1992. 

[13]  H.  Tokuda  and  T.  Kitayama.  Dynamic  QOS  Control  based 
on  Real-Time  Threads.  In  Proceedings  of  the  4th  Interna¬ 
tional  Workshop  on  Network  and  Operating  System  Support 
for  Digital  Audio  and  Video,  November  1993. 


II]  T.E.  Anderson,  B.N.  Bershad,  E.D.  Lazowska,  and  H.M.  114]  H.  Tokuda,  S.  Savage  and  C.W.  Mercer.  A  Real-Time 
Uvy.  Scheduler  Activations:  Effective  Kernel  Support  for  Thread  Model  for  Continuous  M'^ia  Applications.  In 

-  Preparation. 

^This  versioB  of  RTC-Threads  does  not  have  real-dme  facilities  yet 


11 


Experience  with  a  Prototype  of  the  POSIX 
“Minimal  Realtime  System  Profile” 


T.P.  Baker,  Frank  Mueller,  Viresh  Rustagi* 
Department  of  Computer  Science 
Florida  State  University 
Tallahassee,  FL  32304-4019 


Abstract 

'This  paper  describes  experience  prototyping  the 
proposed  IEEE  standard  “minimal  realtime  system 
profile” ,  whose  primary  component  is  support  for  real¬ 
time  threads.  It  provides  some  background,  describes 
the  implementation,  and  reports  preliminary  perfor¬ 
mance  measurements. 

1  Introduction 

A  thread  is  an  independent  sequential  (low  of 
control.  Threads  differ  from  processes  by  sharing 
a  common  virtual  address  space  with  other  threads. 
Threads  are  widely  accepted  as  a  computational  build¬ 
ing  block  for  both  uniprocessor  and  multiprocessor  en- 
viroiunents.  In  uniprocessor  environments,  the  thread 
model  simplifies  the  programming  of  asynchronous  op¬ 
erations.  In  multiprocessor  environments,  threads  may 
also  allow  higher  throughput,  by  utilizing  more  than 
one  processor. 

The  idea  of  cheap  concurrency  or  “lightweight  pro- 
coses”  has  been  around  in  various  forms  for  a  long 
time,  including  support  for  coroutines  in  the  Mesa  pro¬ 
gramming  language[i3],  and  multitasking  in  the  Ada 
programming  language[  18].  The  Pthreads  (POSIX 
Threads)  proposal  is  intended  to  provide  similar  func¬ 
tionality  for  programs  in  the  C  language.  It  is  based  on 
considerable  experience,  including  C-threads  [2],  Mach 
threads[16,  17],  and  Brown  University  threads  [3]. 
Several  commercial  operating  systems  support  multi¬ 
threaded  processes,  including  the  Lynx[4],  Sun[12.  14], 
and  Chorus[l]  operating  systems. 

The  POSIX  1003.4a  project[8]  represents  an  at¬ 
tempt  to  achieve  some  degree  of  application  portability 
for  C  programs,  across  operating  systems  that  support 
threads.  This  is  an  extension  of  the  POSIX  application 
program  interface,  which  generally  follows  the  UNIX 
process  model. 

Threads  are  considered  a  “real  time”  extension  to 
POSIX.  IEEE  draft  standard  P1003.13[9]  proposes  a 
set  of  realtime  application  profiles,  i.e.  subsets  of  the 
POSIX  standard  that  are  suitable  for  certain  clas.ses 
of  realtime  applications.  Threads  are  a  key  feature  of 
these  profiles.  In  particular,  the  “Minimal  Realtime 
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System  Profile”  assumes  a  single  process,  with  threads 
being  the  only  form  of  concurrency  within  the  system. 
The  underlying  hypothesis  is  that  by  not  requiring  sup¬ 
port  for  the  more  complex  POSIX  features,  the  profile 
permits  an  implementation  that  will  be  satisfactory  for 
realtime  applications  with  very  tight  efficiency  and  tim¬ 
ing  predictability  requirements. 

The  POSIX  proposals  are  likely  to  have  an  impact 
on  future  realtime  applications  development,  since  they 
are  being  promoted  as  both  U.S.  Government  and  in¬ 
ternational  (ISO/1  EC)  standards. 

The  draft  Pthreads  standard  specifies  the  follow¬ 
ing  services: 

•  thread  management:  initializing,  creating,  joining, 
and  exiting  threads. 

•  synchronization:  mutual  exclusion,  and  condition 
variables. 

•  thread'Specific  data:  data  maintained  on  per- 
thread  basis. 

•  thread  priority  scheduling:  priority  management, 
preemptive  priority  scheduling,  bounded  priority 
inversion. 

•  signals:  signal  handlers,  asynchronous  wait,  mask¬ 
ing  of  signals,  long  jumps. 

•  cancellation:  cleanup  handlers,  different  interrupt- 
ibilily  states. 

2  Relationship  to  Ada  Tasks 

The  Ada  programming  language[18]  defines  tasks 
as  the  only  form  of  concurrent  threads  of  control  within 
a  program.  If  the  underlying  operating  system  provides 
direct  support  for  the  POSIX  (C-Ianguage)  threads  in¬ 
terface,  it  may  be  desirable  to  implement  Ada  tasks 
using  this  interface,  by  mapping  Ada  tasks  to  POSIX 
threads.  Due  to  diflerences  between  the  Ada  and 
POSIX/C  models,  this  mapping  is  not  entirely  straight¬ 
forward. 

The  PART  (POSIX  Ada  Real  Time)  project,  at 
the  Florida  Stale  University,  is  investigating  the  prac¬ 
ticality  of  using  POSIX  threads  to  implement  Ada 
tasks,  especially  in  realtime  applications.  So  far,  a 
complete  tasking  implementation  has  been  produced 
for  the  Ada  83  standard,  using  an  implementation  of 
P  1003.4a  Draft  6  layered  over  the  Sun  UNIX  operat¬ 
ing  system[5].  Work  is  under  way  to  extend  this  to  the 
proposed  new  Ada  9X  language  standard  [6,  19]. 
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3  A  “Bare  Machine”  Implementation 

To  evaluate  the  suitability  of  a  Plhreads- based 
Ada  implementation  for  realtime  applications,  one 
must  start  out  with  a  suitable  realtime  implementa¬ 
tion  of  Pthreads.  Experience  with  the  FSU  layered  im¬ 
plementation,  and  other  Ada  tasking  implementations, 
makes  it  clear  that  acceptable  realtime  performance 
is  not  achievable  for  an  implementation  layered  over  a 
conventional  UNIX  operating  system.  Efficiency  is  cer¬ 
tainly  an  issue,  but  the  main  problem  is  predictability. 

Most  UNIX  implementations  impose  unp;e- 
dictable  delays  on  user  processes,  due  to  preemptions 
by  interrupt  handlers  and  operating  system  processes. 
For  an  operating  system  to  provide  predictable  timing 
of  user  threads,  it  must  be  designed  with  this  objective 
in  mind,  from  the  hardware  up.  Some  commercial  real¬ 
time  operating  systems,  such  as  LynxOS  and  Chorus, 
apparently  have  been  designed  in  this  way. 

As  a  basis  for  performance  testing  of  our  Ada  9X 
implementation,  we  chose  to  port  our  existing  layered 
implementation  of  Pthreads  to  a  “bare”  SPARCengine 
1E[15].  We  chose  to  do  this  rather  than  using  an  exist¬ 
ing  commercial  realtime  OS,  for  many  reasons.  Chief 
among  these  is  that  we  needed  source  code,  to  tune  the 
threads  implementation  to  better  support  Ada  (if  nec¬ 
essary),  and  to  take  control  over  interrupts.  We  also 
were  concerned  that  the  commercial  implementations 
of  threads  might  be  too  full-blown  to  take  advantage 
of  the  restrictions  of  the  POSIX  minimal  profile,  since 
they  support  multiple  processes,  file  systems,  and  a  va¬ 
riety  of  hardware  devices.  (Finally,  there  was  concern 
that  licensing  restrictions  would  stand  in  the  way  of 
publication.) 

The  SPARCengine  port  of  the  FSU  Pthreads  li¬ 
brary  is  intended  to  fit  the  POSIX  minimal  realtime 
systems  profile.  It  runs  on  a  “bare”  machine,  without 
any  other  operating  system.  It  does  not  support  multi¬ 
ple  processes,  and  so  operates  in  single  virtual  address 
space.  It  does  not  have  a  file  system.  At  present  the 
only  devices  supported  are  a  serial  port  and  a  timer. 
These  simplifications  eliminate  unpredictable  time  de¬ 
lays  due  to  page  faults,  waiting  for  completion  of  I/O, 
and  I/O  completion  interrupt  processing. 

The  scope  of  this  prototype  implementation  is  lim¬ 
ited  to  a  subset  of  the  proposed  POSIX  minimal  real¬ 
time  system  profile.  The  criteria  that  governed  the 
choice  of  this  subset  are: 

•  The  implementation  should  be  powerful  enough  to 
allow  testing  in  a  realtime  context.  This  requires 
the  following  functionalities: 

—  Dynamic  creation  and  termination  of  threads 
-  Synchronization  primitives 
—  A  readable  realtime  clock 
—  Timer  support  sufficient  for  periodic  ta-^k 
scheduling 

—  Output  routines  to  print  results 

•  The  implementation  should  provide  sufficient 
functionality  to  implement  Ada  tasking. 

The  design  of  the  implementation  is  divisible  into 
three  main  components: 


1.  Pthreads  support.  This  implements  the  de¬ 
tailed  functionality  of  the  Pthreads  standard,  in¬ 
cluding  the  dynamic  creation  and  termination 
of  threads,  the  synchronization  primitives,  and 
thread  scheduling.  It  is  the  largest  component, 
but  can  be  identical  to  a  library  implementation. 

2.  Machine-specific  support.  This  includes  code  to 
save  and  restore  register  windows  for  context- 
switches,  boot  up  the  kernel,  and  provide  time¬ 
keeping  services.  The  boot  code  involves  initializ¬ 
ing  memory  mapping  hardware  and  installing  trap 
handlers.  With  the  library  implementation,  all  of 
these  functions  are  performed  by  the  underlying 
operating  system.  A  bare-machine  implementa¬ 
tion  must  perform  these  functions  for  itself. 

3.  C  language  support,  including  baisic  I/O  and  mem¬ 
ory  allocation.  These  are  functions  provided  by 
the  standard  C  libraries,  but  the  standard  Sun 
Microsystems  implementation  of  the  C  libraries 
makes  calls  to  the  operating  system.  Without  the 
support  of  the  operating  system,  these  libraries 
need  to  be  reimplemented. 

The  design  of  the  Pthreads  functionality  was  con¬ 
strained  to  be  async  safe.  A  function  is  async  safe 
if  calling  the  function  asynchronously  will  not  cause 
any  invariants  to  be  violated,  even  if  it  is  called  from 
the  handler  of  an  interrupt  that  may  be  delivered  at 
any  time  [7].  Even  though  POSIX  does  not  require  the 
Pthreads  functions  to  be  eisync  safe,  we  chose  to  require 
it  as  a  matter  of  quality.  Async  safety  allows  a  user  to 
build  more  responsive  realtime  systems.  Furthermore, 
it  is  required  to  support  Ada  9X. 

A  singte- threaded  heme/  approach  was  used.  Once 
a  thread  has  entered  the  kernel,  no  other  thread  can  en¬ 
ter  the  kernel  until  that  thread  has  left  it.  The  alterna¬ 
tive,  a  mulli- threaded  kernel,  where  separate  locks  are 
associated  with  different  kernel  data  structures,  would 
allow  more  concurrency  in  a  multiprocessor  environ¬ 
ment.  For  this  to  pay  off,  the  cost  of  interprocessor 
locking  must  be  low,  relative  to  the  time  typically  spent 
in  kernel.  Since  we  have  only  a  single  processor,  the 
choice  was  clear;  the  overhead  of  fine-grained  locking 
would  result  in  poorer  performance. 

The  source  code  of  our  bare-machine  Pthreads  ker¬ 
nel  consists  of  approximately  3300  lines  of  C-code,  of 
which  approximately  1000  lines  are  new  for  the  bare- 
machine  version  and  the  rest  is  reused  from  the  library 
level  implementation.  The  core  iniage  of  the  kernel  is 
49  kilobytes,  as  compared  to  984  kilobytes  for  the  full 
Sun  UNIX  kernel. 

Reuse  of  most  of  the  code  from  the  layered  FSU 
Pthreads  library  permits  direct  performance  compar¬ 
isons  of  the  two  implementations.  Differences  can  be 
attributed  to  running  on  a  bare  machine,  versus  as  a 
layer  over  the  UNIX  operating  system. 

This  code  was  tested  for  both  functionality  and 
performance. 

Functional  Testing  Functional  testing  was  done  us¬ 
ing  a  set  of  25  tests,  derived  from  tests  originally  devel¬ 
oped  to  test  the  layered  version  of  the  FSU  Pthreads 
library.  The  features  tested  by  these  bare-machine  tests 
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include: 


•  Thread  management  -  creation,  termination,  join, 
detach 

•  Priority  scheduling 

•  Mutexes  -  with  and  without  priority  ceilings 

•  Creating  and  destroying  condition  variables 

•  Timed  conditional  wait 

•  Thread  specific  data 

•  Setjmp/longjmp 

•  Signal  handlers 

•  Cancellation  and  cleanup  handlers 

It  was  verified  that  the  implementation  could  pass 
these  tests,  before  performance  testing  began. 

4  Absolute  Performance  Results 

The  performance  tests  were  also  derived  from  tests 
developed  earlier  for  the  layered  version  of  the  Ptitreads 
library.  These  tests  attempt  to  measure  the  specific 
performance  metrics  called  out  by  Draft  6  of  Pthreads. 

Table  1  shows  selected  measurements  of  some  of 
these  metrics.  The  test  programs  use  a  dual  loop  tim¬ 
ing  analysis  technique.  The  times  reported  are  averages 
trdcen  over  100,000  iterations.  These  measurements  are 
compared  to  measurements  taken  earlier  with  the  ver¬ 
sion  of  the  Pthreads  library  layered  over  UNIX,  on  the 
same  machine. 

For  the  layered  implementation,  the  time  taken  for 
100,000  iterations  of  an  operation  ranges  from  about 
100  milliseconds  to  1  second.  Though  there  was  only 
one  user  process  active,  this  is  long  enough  that  a  sys¬ 
tem  process  might  preempt,  so  the  numbers  shown  here 
may  be  a  bit  high. 

The  metrics  include: 

•  Enter  and  exit  Pthreads  kernel.  This  is  the  time 
taken  to  enter  and  immediately  exit  the  kernel. 

•  Mutex  lock/unlock,  no  contention.  This  is  the  time 
to  perform  a  pair  of  mutex  lock  and  unlock  oper¬ 
ations,  under  the  assumption  that  a  mutex  is  re¬ 
quested  while  unlocked. 

•  mutex  lock/unlock,  contention.  This  is  the  interval 
between  an  unlock  by  one  thread  and  the  return 
from  a  lock  operation  by  another  thread,  which 
was  suspended  waiting  for  the  mutex. 

•  Semaphore  synchronization.  This  is  one  Dijkstra 
P  operation  plus  one  V  operation.  These  are  im¬ 
plemented  on  top  of  mutexes  and  condition  vari¬ 
ables. 

•  Thread  create,  no  context  switch.  This  mea.sures 
the  time  taken  to  create  a  thread,  excluding  the 
context  switch  time. 

•  setjmp/longjmp  pair.  This  is  the  time  taken  by  a 
set  jmp  followed  by  a  longjmp. 

The  performance  of  a  pair  of  set  jmp  and  longjmp 
operations  gives  a  lower  bound  on  the  overhead 
of  a  context  switch,  but  a  true  context  switch  in¬ 
volves  some  additional  overhead. 


•  Thread  context  switch.  This  is  the  time  taken  for 
a  context  switch. 

•  Yield  (I  thread).  This  is  the  time  taken  by  the 
yield  operation  when  there  is  only  one  thread  in 
the  system. 

•  Yield  (2  threads).  In  this  C2ise,  there  are  two 
threads  in  the  system. 

•  Thread  signal  handler.  The  measurements  taken 
for  signal  handling  reflect  the  time  from  sending  a 
signal,  by  pthreadJtill,  till  the  signal  is  received. 


Table  1 .  Performance  of  some  Pthreads  Operations 


Pthreads  Operations 

Timings  (psecs) 

Bare 

Machine 

Layered 
over  UNIX 

enter  and  exit  Pthreads  kernel 

1 

1 

mutex  lock/unlock,  no  contention 

3 

3 

mutex  lock/unlock,  contention 

44 

114 

semaphore  synchronization 

60 

103 

thread  create,  no  context  switch 

37 

104 

setjmp/longjmp  pair 

16 

49 

thread  context  switch 

17 

95 

yield  1  operation 

1 

1 

yield  2  operations 

33 

70 

thread  signal  handler 

55 

92 

5  Time  Predictability  Results 

From  the  design  of  the  implementation,  we  ex¬ 
pect  our  bare-machine  implementation  to  achieve  pre¬ 
dictable  execution  timing.  The  main  cause  of  large 
deviations  from  the  priority  preemptive  scheduling 
model  has  been  eliminated,  namely  preemption  of  user 
threads  due  to  scheduling  of  other  processes,  includ¬ 
ing  operating  system  processes.  The  precision  of  timed 
wakeup  events  has  also  been  improved,  from  ten  down 
to  one  millisecond.  With  these  improvements,  we  be¬ 
lieve  that  the  implementation  of  priority  scheduling  is 
strict  enough  that  actual  schedulable  utilization  will  be 
very  close  to  the  theoretical  predictions  of  schedulabil- 
ity  analysis. 

During  debugging,  we  have  already  observed  that 
the  timing  is  remarkably  consistent.  This  was  evident 
in  the  reproducibility  of  failures  due  to  race-condition 
problems  between  (earlier,  incorrect  versions  of)  the 
timer  interrupt  handler  and  the  rest  of  the  system. 

We  are  currently  working  on  benchmarks  to  mea¬ 
sure  the  predictability  of  scheduling  for  actual  task 
sets,  to  compare  these  against  theoretical  schedulabil- 
ity  models,  and  to  estimate  the  amount  of  overhead 
introduced  by  the  Pthreads  implementation. 

The  first  test  is  based  on  a  benchmark  developed 
earlier  for  a  preliminary  design  of  a  minesweeper  trainer 
system  for  the  U.S.  Naval  Coastal  Systems  Center. 
'I’his  consists  of  a. set  of  six  periodic  threads,  comprising 
a  realtime  simulation.  Each  thread  hats  three  phases. 
In  the  first  phcise,  it  reads  the  simulated  state  of  other 
simulated  subsystems  from  a  global  database.  In  the 


second  phase,  it  computes  its  own  nexl  slate.  In  the 
third  phase,  it  updates  the  global  database.  'I'he  read 
and  update  phases  require  locking  the  datal)a.se,  which 
is  done  via  a  single  Pthread  mutex.  This  is  shown  in 
I>8eudo-code  in  Figure  1.  The  thread  periods,  and  the 
execution  times  of  the  three  phases,  arc  shown  in  Ta¬ 
ble  2. 

<or(;:)  { 

pthread.BUtex.lockCftshared.Bemory) ; 
input _data() ; 

pthr«ad_Butex_unlock (kshared.BeBory ) ; 
execute () ; 

pthread.But ex.lock (kshared.BeBory ) ; 
output.dataO ; 

pthread_Butex_unlock(kshared_BeBory) ; 

next.requestftask]  +*  periodftask]  : 
it  (next.request [task]  >>  siBulation.tiBe) 
break ; 

/•  suspend  until  next  period  */ 
pthread_Butex_lock(tnutex[seli] ) ; 
do  < 

pthread.cond.tiBedsait (tcondfself] , 
kautexCseli] ,  knext.request [self] ) ; 
clock_gettiBe(C[.OCK_REALTIME,  fccurrent.time) ; 

)  shile  (next_request[task]  >  current.time) ; 

) 

Figure  1;  Task  Simulation  Algorithm 


Task 

Period 

[ms] 

Input 

[/*s] 

Execute 

[ms] 

Output 

[/«s] 

Util. 

1 

62.5 

2.0 

44.80 

2.0 

71% 

2 

125.0 

0.3 

0.05 

0.3 

0% 

3 

166.7 

1.6 

27.80 

1.6 

16% 

4 

250.0 

8.0 

0.1 1 

8.0 

0% 

5 

500.0 

3.2 

5.02 

3.2 

1% 

6 

1,000.0 

24.0 

10.36 

24.0 

1% 

Table  2;  Task  Set 


Such  a  task  system  should  be  suitable  for  schedu- 
lability  analysis,  based  on  the  llate-Monotonic  model. 
The  objective  of  our  benchmark  is  to  determine  how 
close  the  actual  performance  comes  to  lliis  model. 

In  the  benchmark,  a  bisection  method  is  used  to 
compute  the  breakdown  utilization,  at  which  the  ta.sks 
can  just  barely  be  scheduled  without  missing  any  dead¬ 
lines.  This  is  done  by  varying  a  linear  scaling  factor, 
called  load  factor,  which  applies  to  the  execution  times 
of  all  phaises  of  all  the  tasks. 

The  benchmark  was  run  repeatedly  over  both 
UNIX  and  the  bare-machine  implementation  with  an 
initial  target  utilization  of  90%.  The  results  are  shown 
in  Figure  2. 

It  was  observed  that  the  timing  of  the  benchmark 


Figure  2:  System  Utilization  for  repeated  Trails 


over  UNIX  varies  considerably  at  times.  The  bisection 
sometimes  failed  on  its  first  iteration,  thereby  indicat¬ 
ing  that  the  breakdown  utilization  of  90%  must  reduced 
below  45%.  The  bisection  would  then  proceed  to  ter¬ 
minate  at  a  utilization  around  44%.  At  other  times, 
the  bisection  succeed  for  a  trial  of  a  certain  load  factor. 
Upon  termination  of  the  bisection,  the  same  load  factor 
was  tried  again  but  resulted  in  a  failure.  We  adapted 
our  algorithm  to  restart  the  bisection  with  the  current 
load  factor  as  the  upper  bounds  upon  these  sporadic 
failures. 

The  bare-machine  implementation  produced  very 
predictable  results  without  any  variation.  The  utiliza¬ 
tion  of  the  benchmark  was  measured  at  81%.  The  re¬ 
maining  19%  can  be  interpreted  as  the  time  consumed 
by  the  bare-machine  implementation  of  Pthreads.  Un¬ 
der  UNI.K,  the  benchmark  utilization  had  its  peak  at 
77%  with  a  remaining  23%  overhead  due  to  the  operat¬ 
ing  system  and  the  layered  Pthreads  implementation. 
The  smaller  overhead  of  the  bare-machine  implemen¬ 
tation  can  be  attributed  to  the  performance  improve¬ 
ments  discussed  in  the  l^lst  section. 

The  occasionally  large  variations  in  the  utilization 
under  UNIX  and  the  sporadic  failures  of  the  bisection 
algorithm  seem  to  be  due  to  operating  system  activi¬ 
ties  which  occur  at  unpredictable  times.  These  activi¬ 
ties  include  process  scheduling,  CPU  time  accounting, 
and  the  processing  of  ethernet  messages*.  The  unpre¬ 
dictability  of  the  UNIX  operating  system  limits  its  ap¬ 
plicability  for  hard  real-time  systems.  Hard  real-time 
applications  may  not  be  able  to  safely  achieve  a  high 
utilization  under  UNIX.  A  bare-machine  implementa¬ 
tion  .seems  to  permit  a  higher  utilization  for  hard  real¬ 
time  applications,  providing  both  predictability  and  an 
efficient  use  of  the  hardware. 

6  Conclusions 

We  have  implemented  a  sufficient  subset  of  the 
Minimal  Realtime  System  Profile  to  permit  perfor¬ 
mance  testing.  The  implementation  supports  preemp¬ 
tive  priority  scheduling,  with  a  restricted  form  of  pri¬ 
ority  ceiling  emulation  for  mutexes.  It  supports  a  re- 

'Tliere  was  no  local  hard  disk  attached  to  the  SPARCengine. 
The  only  asynchronous  activities  were  due  to  clock  and  ethemeC 
interrupts. 
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altime  clock  witli  iiiicrosecoiid  precision,  and  timed 
events  with  millisecond  precision,  including  the  time¬ 
out  for  the  wait  operation  on  a  condii.on  variable. 
Experience  with  this  implementation  suggests  that 
Pthreads  can  be  implemented  in  a  form  that  is  suitable 
for  realtime  applications  with  hard  timing  constraints. 

The  absolute  performance  figures  are  encouraging. 
The  performance  of  the  bare-machine  implementation 
is  much  better  than  that  of  the  version  layered  over  a 
full  UNIX  system.  Part  of  this  improvement  is  due  to 
our  algorithm  for  saving  register  windows  to  memory, 
which  is  different  from  that  used  by  the  commercial 
UNIX  operating  system.  The  other  big  contribution  to 
the  performance  improvement  is  that  our  implementa¬ 
tion  avoids  most  of  the  overhead  of  UNIX  system  calls. 
User  code  executes  in  the  same  virtual  address  space 
as  the  kernel.  This  means  kernel  service  calls  can  be 
ordinary  subprogram  calls,  or  even  in-line  macro  calls, 
rather  than  traps.  We  also  eliminate  the  overhead  of 
demultiplexing  service  requests  in  the  UNIX  system 
call  trap  handler.  This  improvement  seems  specific  to 
the  minimal  realtime  systems  profile.  Running  kernel 
and  user  processes  in  the  same  virtual  address  space 
would  be  unacceptable  for  a  full  POSIX  implementa¬ 
tion. 

The  experiments  performed  support  the  hypoth¬ 
esis  that  a  bare-machine  implementation  can  achieve 
excellent  predictability.  This  provides  the  ability  of 
this  system  to  support  a  priori  schedulability  analy¬ 
sis,  much  in  contrast  to  unpredictable  systems  such  as 
UNIX. 

Next,  we  plan  to  port  the  PART  Ada  runtime  sys¬ 
tem  implementation  to  the  bare-processor  Pthreads  im¬ 
plementation,  and  test  both  the  absolute  speed  and  the 
timing  predictability  of  Ada.  This  may  require  extend¬ 
ing  the  functionality  of  the  present  implementation  in 
some  respects.  Handlers  need  to  be  written  for  some 
traps  that  generate  synchronous  signals.  For  example, 
a  mem-address.not.aligned  trap  should  be  processed  to 

Jenerate  a  SIGBUS  for  the  current  thread.  The  current 
'  library  support  also  needs  some  extensions. 

Efforts  will  be  made  to  flesh  out  the  implemen¬ 
tation  in  other  respects,  including  .support  for  timer- 
driven  round-robin  scheduling,  and  some  debugging 
support. 

References 

[1]  F.  Armand,  F.  Herrmann,  J.  Lipki.s,  and 
M.  Rozier,  “Multi-threaded  Processes  in  ClIO- 
RUS/MIX”,  Proceedings  of  EEUG  Conference 
(Spring  1990)  1-13. 

[2]  E.  Cooper  and  R.  Draves,  “C  threads”.  TR 
CMU-CS-88-154,  Carnegie  Mellon  University, 
Dept,  of  CS  (1988). 

[3]  T.  Doeppner  Jr.,  A  threads  tutorial,  TR  CS-87- 
06,  Brown  University,  Dept,  of  C.S  (1987). 

[4]  Bill  0.  Gallmeisler  and  Chris  Lanier.  “Early 
experience  with  POSIX  1003.4  and  POSIX 
1003.4a”,  IEEE  Symposium  on  Heat-Time  Sys¬ 
tems,  IEEE  Computer  Society  (1991)  190  -198. 

[5]  E.W.  Giering  and  T.P.  Baker,  “Using  POSIX 
threads  to  implement  Ada  tasking:  Description 


of  work  in  progress”,  TRl-Ada  ’92  Proceedings 
(Nov  1992) 518-529. 

[6]  E.W.  Giering,  Frank  Mueller,  and  T.P.  Baker, 
“Implementing  Ada  9x  features  using  POSIX 
threads:  Design  issues”,  TRI-Ada  ’93  Proceed¬ 
ings,  ACM  (Sep  1993)  214-228. 

[7]  IEEE  Portable  Applications  Standards  Commit¬ 
tee,  PlOOS.fa:  Threads  Extension  for  Portable 
Operating  Systems  (Draft  6),  IEEE  (Feb  1992). 

[8]  IEEE  Portable  Applications  Standards  Commit¬ 
tee,  PlOOS.fa:  Threads  Extension  for  Portable 
Operating  Systems  (Draft  8),  IEEE  (Oct  1993). 

[9]  IEEE  Portable  Applications  Standards  Com¬ 
mittee,  PI003.13:  Information  Technology  - 
Standardised  Applications  Environment  Profile 
-  POSIX  Realtime  Application  Support  (AEP) 
(Draft  5)  (Feb  1992). 

[10]  Frank  Mueller,  “Implementing  POSIX  threads 
under  UNIX.  Description  of  work  in  progress”. 
Proceedings  of  the  Second  Software  Engineering 
Research  Forum  (Nov  1992)  253-261. 

[11]  Frank  Mueller,  “A  library  implementation  of 
POSIX  threads  under  UNIX”,  Proceedings  of  the 
USENIX  Conference  (Jan  1993)  29-41. 

[12]  M.L.  Powell,  S.R.  Kleiman,  S. Barton,  D.  Shah, 
D.  Stein,  and  M.  Weeks,  “SunOS  Multi-thread 
Architecture”,  USENIX  (Winter  1991)  65-80. 

[13]  D.  D.  Redell  et  al.,  “Pilot;  An  operating  system 
for  a  personal  computer”,  Communications  of  the 
ACM,  Vol.  23,  No.  2  (Feb  1980). 

[14]  D.  Stein  and  D.  Shah,  “Implementing  lightweight 
threads”.  Proceedings  of  the  USENIX  Conference 
(Summer  1992)  1-10. 

[15]  SUN  Microsystems,  Inc.,  The  SPARCengine  IE 
Card  Family  User’s  Manuals  Part  No:  800-8137- 
02  (Apr  1990) 

[16]  A.  Tevanian,  R.  F.  Rashid,  D.  B  Golub,  D.  L. 
Black,  E.  Cooper,  and  M.  W.  Young,  “MACH 
threads  and  the  UNIX  kernel:  The  battle  for 
control”.  Proceedings  of  the  USENIX  Conference 
(Summer  1987)  185-197. 

[17]  Hideyuki  Tokuda,  Tatsuo  Nakajima,  and  Prithvi 
Rao,  “Real-Time  MACH;  towards  a  predictable 
real-time  system”,  USENIX  MACH  Workshop 
(Oct  1990). 

[18]  U  S.  Department  of  Defense.  Military  Standard 
Ada  Programming  Language  ANSI/MIL-STD- 
1815A,  Ada  Joint  Program  Office  (Jan  1983). 

[19]  Ada  9X  Mapping/ Revision  Team,  Ada  9X  Ref¬ 
erence  Manual:  Draft  Version  4  0>  Intermet¬ 
rics,  Inc.,  733  Concord  Avenue,  Cambridge,  Mas¬ 
sachusetts  02138  (available  by  anonymous  FTP 
from  ajpo.sei.cmu.edu)  (Sep  1993). 

Availability  of  Source  Code 

The  source  code  of  the  version  of  the  Pthreads 
library  layered  over  UNIX  is  available  via  anonymous 
ftp  from  ftp.cs.fsu.edu  (128.186.121.27),  in  the  file 
/pub/PART/pthreads.tar.Z.  Other  material  (related 
publications)  can  be  found  in  the  same  directory. 


16 


Session  II: 
Scheduling  I 

Chair:  Ted  Baker 
Florida  State 


An  End-to-End  Approach  to  Schedule  Tasks  with  Shared  Resources 

in  Multiprocessor  Systems 

Jun  Sun  Riccardo  Bettati  Jane  W.-S.  Liu 
Department  of  Computer  Science 
University  of  Illinois,  Urbana-Champaign 
Urbana,  IL  61801 


Abstract 

In  this  paper  we  propose  an  end-to-end  approach 
to  scheduling  tasks  that  share  resources  in  a  multipro¬ 
cessor  or  distributed  systems.  In  our  approach,  each 
task  is  mapped  into  a  chain  of  subtasks,  depending  on 
its  resource  accesses.  After  each  subtask  is  assigned 
a  proper  priority,  its  worst-case  response  time  can  be 
bounded.  Consequently  the  worst-case  response  time 
of  each  task  can  be  obtained  and  the  schedulability  of 
each  task  can  be  verified  by  comparing  the  worst-case 
response  time  with  its  relative  deadline. 


1  Introduction 

Tasks  in  real'time  systems  often  share  resources, 
and  semaphore-like  operations  are  necessary  to  guar¬ 
antee  their  mutual-exclusive  access  to  critical  sections. 
A  previous  study  shows  that  careless  use  of  semaphore 
operations  can  cause  uncontrolled  priority  inversion, 
which  occurs  when  a  high-priority  task  is  blocked  by 
some  low-priority  tasks  for  an  unpredictable  amount 
of  time  [1].  We  refer  to  the  total  length  of  time  a  task 
is  delayed  by  lower-priority  tasks  due  to  resource  con¬ 
tention  as  its  blocking  time.  To  ensure  predictability,  it 
is  imperative  to  bound  the  blocking  time  of  each  task, 
as  shown  in  [2].  Several  effective  solutions  have  been 
proposed  for  single  processor  systems;  two  well-known 
examples  are  the  Priority  Ceiling  Protocol  (PCP)  [1] 
and  the  Stack  Based  Protocol  (SBP)  [3]. 

In  multiprocessor  and  distributed  systems  concur¬ 
rency  and  distribution  complicate  the  resource  con¬ 
tention  problem.  A  task  Ti  can  be  blocked  not  only 
by  a  local  task  on  the  same  processor  due  to  local 
resource  contentions,  but  also  by  a  remote  task  that 
needs  some  global  resources  also  needed  by  Ti.  Rajku- 
mar,  et  al.  [4]  extended  PCP  for  single  processor  sys¬ 
tems  to  multiprocessor  systems  and  provided  an  initial 
solution  for  this  problem.  The  extended  protocol  is 
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known  as  the  Multiprocessor  Priority  Ceiling  Protocol 
(MPCP).  According  to  MPCP,  a  resource  needed  by 
remote  tasks  on  other  processors  is  a  global  resource, 
and  the  processor  on  which  a  global  resource  resides  is 
called  its  synchronization  processor.  When  a  task  Ti 
gains  access  to  a  global  resource,  a  Global  Critical  Sec¬ 
tion  IGCS)  server  runs  on  the  resource’s  synchroniza¬ 
tion  processor  on  behalf  of  Ti .  On  each  processor  PCP 
is  used  to  schedule  both  local  tasks  and  CCS  servers. 
Consequently,  for  each  task,  the  total  blocking  time 
due  to  both  local  resource  contention  and  global  re¬ 
source  contention  can  be  bounded,  and  whether  each 
task  can  meet  its  deadline  can  be  determined  based 
on  this  blocking  time  by  using  the  schedulability  con¬ 
dition  for  the  single-processor  PCP. 

However,  the  performance  of  MPCP  is  sometimes 
poor,  especially  for  tasks  on  synchronization  proces¬ 
sors.  One  reason  is  that  GCS  servers  on  each  synchro¬ 
nization  processor  always  have  higher  priorities  than 
local  tasks.  The  priority  inversion  problem  is  rein¬ 
troduced  when  a  high-priority  local  task  is  delayed 
by  GCS  servers  executing  on  behalf  of  lower-priority 
tasks. 

In  this  paper  we  propose  an  end-to-end  approach 
to  scheduling  tasks  with  shared  resources  and  to  ana¬ 
lyzing  their  schedulability  in  multiprocessor  systems. 
Section  2  gives  an  informal  description  of  this  ap¬ 
proach  and  compares  and  contrasts  it  with  MPCP. 
Section  3  presents  in  detail  the  procedure  used  in  the 
end-to-end  approach.  Future  work  is  discussed  in  sec¬ 
tion  4. 


2  The  End-to-End  Scheduling  Ap¬ 
proach 

From  the  viewpoint  of  end-to-end  scheduling,  a  task 
that  needs  remote  resources  is  viewed  as  a  chain  of 
subtasks  in  the  following  way.  Each  critical  section 
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associated  with  a  remote  resource  is  a  subtask  that 
executes  on  the  synchronization  processor  of  the  re¬ 
mote  resource.  A  segment  that  requires  no  resources 
or  only  local  resources  is  also  a  subtask,  and  this  sub¬ 
task  executes  on  the  local  processor.  Subtasks  of  the 
same  task  collectively  inherit  the  task’s  release  time 
and  deadline,  and  they  execute  in  turn.  Specifically, 
if  task  Ti  has  n  subtasks,  subtask  Ti^i  is  ready  for  ex¬ 
ecution  at  the  release  time  of  Ti ,  and  subtask  Tij  is 
ready  for  execution  when  subtask  Ti  j-i  completes,  for 
j  =  2, 3, . . .,  n.  The  last  subtask  Ti^„  must  complete 
by  the  deadline  of  7}.  If  task  Ti  is  a  periodic  task,  this 
precedence  relation  holds  for  every  instance  of  Ti. 

The  precedence  relation  among  the  subtasks  of 
each  task  can  be  easily  satisfied  by  using  the  phase- 
modification  method  proposed  in  [5].  Let  c,-,j  be 
the  worst-case  response  time  of  Tij.  According  to 
the  phase-modification  method,  once  we  know  ci^t  for 
ifc  =  1, 2, . . . ,  j  —  1,  we  postpone  the  phase  of  the  sub¬ 
task  Tij  by  This  modification  allows  us 

to  enforce  the  precedence  relation  between  subtasks 
while  treating  the  subtasks  in  each  task  as  if  there  is 
no  precedence  relation  between  them.  We  will  return 
to  discuss  how  to  bound  the  worst-case  response  times 
of  subtasks  on  each  processor  using  the  schedulability 
condition  in  [5],  provided  that  the  subtasks  are  as¬ 
signed  fixed  priorities  and  some  single-processor  syn¬ 
chronization  protocol  is  used  to  control  priority  inver¬ 
sion.  By  summing  up  the  worst-case  response  times 
of  all  its  subtasks,  we  can  determine  the  worst-case 
response  time  of  each  task,  and  therefore  whether  the 
task  can  meet  its  deadline. 

Similar  to  MPCP,  we  allow  nested  resource  ac¬ 
cesses.  However,  we  impose  an  additional  restriction 
that  all  resources  accessed  in  one  nested  critical  sec¬ 
tion  must  reside  on  the  same  processor.  In  other  words 
accesses  to  resources  on  different  processors  cannot  be 
nested.  One  consequence  of  the  end-to-end  schedul¬ 
ing  approach  is  that  there  is  no  need  to  control  the 
accesses  to  remote,  global  resources  differently  from 
local  resources.  Each  subtask  that  is  a  GCS  server  in 
MPCP  model  is  local  to  its  synchronization  proces¬ 
sor.  All  resource  contentions  are  resolved  locally  and 
separately  on  each  processor. 

Table  I  gives  an  example,  Example  1.  In  the  table, 
Ti  denotes  a  task;  column  proc  lists  the  processor  Ti  is 
assigned  to;  is  Ti's  priority;  p,-  denotes  Tj’s  period; 
and  Ti  stands  for  Ti’s  processing  time.  The  smaller 
the  value  of  the  higher  Ti's  priority.  The  system 
in  this  example  has  two  processors  Pi  and  Pa-  There 
are  two  periodic  tasks,  Ti  and  Ta,  and  one  resource  R. 
The  deadline  for  each  task  is  the  end  of  its  period.  7i 


is  assigned  to  Pi;  Ta  and  R  are  on  Pj.  The  table  lists 
the  parameters  of  the  tasks.  Specifically,  Ti  has  three 
segments.  The  first  and  the  last  segments  need  no  re¬ 
source;  they  are  executed  on  Pi ,  each  with  processing 
time  2.  The  middle  segment  requires  the  resource  R; 
its  processing  time  is  2.  (The  notation  t(R)  in  the  Seg¬ 
ments  column  indicates  that  the  segment  is  a  critical 
section  that  has  duration  t  and  accesses  the  resource 
R.)  We  note  that  the  tasks  can  not  be  scheduled  ac¬ 
cording  to  MPCP.  Since  Ti  needs  to  access  R  on  P2, 
there  is  a  GCS  server  running  on  P2  on  behalf  of  Ti. 
This  server  has  a  higher  priority  than  T2.  Since  the 
processing  time  for  this  server  is  as  long  as  T2’s  period 
and  r2  will  be  blocked  by  the  GCS  server  whenever 
the  server  executes,  T2  can  not  meet  its  deadline. 
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Table  1:  Example  1  -  A  Simple  System 


In  the  end-to-end  scheduling  model,  task  Ti  is  di¬ 
vided  into  three  subtasks,  Ti,i,  Ti,2  and  Ti,3.  Ti,i  and 
Ti,3  execute  on  processor  Pi  and  need  no  resource, 
while  Ti,2  executes  on  P2  and  needs  resource  R.  Ti,i, 
ri,2  and  Ti,3  are  dependent;  the  Jbth  instance  of  Ti,i 
(i.e.,  the  instance  of  Ti,i  in  its  ikth  period)  must  com¬ 
plete  before  the  ibth  instance  of  Ti,2  can  begin  execu¬ 
tion.  Similarly,  the  ibth  instance  of  Ti,3  cannot  start 
execution  until  the  ibth  instance  of  Ti,2  completes.  Ta¬ 
ble  2  shows  the  parameters  of  the  subtasks.  Tij  is  the 
processing  time  of  subtask  Tij,  fij  denotes  the  mod¬ 
ified  phase  of  Tij,  and  0ij  denotes  the  blocking  time 
Tij  can  experience. 
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Table  2:  Example  1  -  Using  the  End-to-End  Approach 
to  Schedule  the  Simple  System 


In  this  example,  there  is  only  one  critical  section, 
and  therefore  there  is  no  blocking.  The  priorities  of 
the  subtasks  are  assigned  on  rate-monotonic  basis.  We 
see  that  the  worst-case  response  time  Ci  of  the  task 
Ti  is  ci,i  -f  Cl, 2  +  Cl, 3  =  10,  which  is  less  than  20,  and 
the  worst-case  response  time  of  T2  is  1,  and  it  is  less 
than  2.  We  can  therefore  conclude  that  the  deadlines 
of  both  tasks  are  always  met. 
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task  has  nested  resource  accesses,  each  outermost 
critical  section  is  mapped  to  a  subtask. 

2.  A  subtask  that  requires  no  resource  or  only  local 
resources  is  on  the  local  processor  of  7j.  A  sub¬ 
task  that  requires  remote  resources  is  on  the  syn¬ 
chronization  processor  of  the  remote  resources. 


Input  : 

1.  Task  set  {T,}.  For  each  task  Ti,  the  dead¬ 
line  Di,  period  p,-,  processing  time  r,-,  and 
resource  accesses; 

2.  The  task  assignment  mapping  task  set  {Ti] 
to  processor  set  {Pt}; 

3.  The  resource  set  {iZj}  and  the  resource  as¬ 
signment  mapping  {P,-}  to  {Pt}. 


Output  :  The  conclusion  whether  the  system  can 
be  scheduled  and  the  priorities  assigned  to  sub¬ 
tasks  on  each  processor  in  the  case  the  system  is 
schedulable. 

Step  1  :  Map  the  given  task  set  {7{}  to  a  end-to- 
end  task  set  {Tij}. 

Step  2  :  Assign  priorities  to  subtasks. 

Step  3  :  Obtain  the  worst-case  response  time  for 
each  subtask. 

Step  4  :  Based  on  the  results  obtained  in  Step  3, 
analyze  the  schedulability  for  the  whole  system. 


Figure  1;  Pseudo-Code  of  the  End-to-End  Scheduling 
Procedure 

3  Schedulability  Analysis 

We  now  describe  how  to  choose  the  priorities 
for  subtasks  and  determine  their  worst-case  response 
times.  We  confine  our  attention  to  the  case  where 
tasks  are  periodic  aud  their  subtasks  are  assigned  fixed 
priorities.  However,  the  subtasks  of  each  task  may  be 
assigned  different  priorities. 

Figure  1  gives  the  pseudo-code  description  of  the 
end-to-end  scheduling  procedure. 

Step  1  :  Map  the  given  task  set  to  an  end- 
to-end  task  set 

Following  the  rules  below,  Step  1  breaks  up  each 
task  Ti  in  the  given  task  set  into  a  chain  of  n,-  subtasks 
Tij  in  the  corresponding  end-to-end  task  set  ; 

1.  Each  subtask  Tij  is  either  a  critical  section  that 
requires  some  remote  resources  or  a  segment  that 
requires  no  resource  or  only  local  resources.  If  a 


3.  For  every  j  =  1, 2, . . . ,  nj— 1,  consecutive  subtasks 
Tij  and  Tij+i  are  on  different  processors. 

Rule  3  is  not  necessary  for  the  correctness  of  the  later 
discussion.  However  it  allows  us  to  obtain  a  tighter 
upper  bound  for  the  response  time  of  each  subtask. 

Example  2  illustrates  the  rules  described  above.  In 
this  example  there  are  four  resources  and  three  proces¬ 
sors.  Resource  Ri  is  assigned  to  processor  Pi ;  /Z3  and 
Rs  to  P2;  and  R4  to  P3.  Task  Ti  is  a  periodic  task.  It 
has  10  segments,  as  shown  by  Figure  2.  The  shaded 
segments  denote  that  Ti  requires  some  resources  dur¬ 
ing  those  time  intervals. 

According  to  Step  1,  Ti  is  mapped  into  6  subtasks, 
as  shown  by  Table  3.  The  segment  from  time  0  to  time 
6,  denoted  as  (0,6],  is  mapped  onto  one  subtask  7i,i 
because  during  this  time  interval,  Ti  either  does  not 
require  any  resources  or  only  requires  local  resources. 
According  to  rule  3,  we  map  it  onto  one  subtask,  and 
it  runs  on  the  local  processor.  Pi .  Similarly,  segment 
(6,10]  is  mapped  onto  the  subtask  Ti,3  because  the 
accesses  to  R2  and  R3  are  nested  and  only  the  out¬ 
most  critical  section  becomes  a  subtask.  This  subtask 
runs  on  processor  P2.  Segments  (16,19]  and  (19,22] 
are  two  different  subtasks,  Ti,4  and  Ti,5,  because  they 
access  different  remote  resources.  They  run  on  P3  and 
P3  respectively.  The  segments  (10,16]  and  (22,24]  are 
mapped  onto  7i,3  and  Ti,®.  They  are  both  on  Pi. 


Table  3:  Example  2  -  Subtasks  Assignment 


Step  2  :  Assign  priorities  to  subtasks 

Several  methods  can  be  used  to  assign  priorities. 
Rate-monotonic  assignment  is  a  possible  choice.  Other 
choices  include  : 


R1  R2  R3  R3  R3  R4 


0  3  4  6  8  10  13  14  16  18  30  33  34 


<1191 

proc 

wm 

El 

1  Segments 

|iiai 

Pi 

1^1 

Figure  2:  Example  2  -  Task  Ti 


•  Globai-deadline-monotonic  assignment;  the  pri¬ 
ority  of  a  subtask  is  based  on  the  global  rela¬ 
tive  deadline,  Di,  the  deadline  of  the  task  T^;  the 
shorter  Di  is,  the  higher  priority  Tij  has. 

•  Effective-deadline-monotonic  assignment;  the 
priority  of  a  subtask  Tij  is  chosen  based  on  sub¬ 
task’s  effective  relative  deadline.  The  effective  rel¬ 
ative  deadline  EDij  of  Tij  in  a  task  Ti  with  n,- 
subtasks  is; 


and  SBP  can  be  used  in  this  case.  Furthermore,  we 
can  obtain  the  worst-case  blocking  time  0ij  for  each 
subtask  Tij.  Consequently  the  worst-case  response 
time  Cij  for  each  subtask  can  be  computed  accord¬ 
ing  to  the  following  equation.  The  derivation  for  this 
equation  can  be  found  in  [5]. 


(1) 


ni 

A-  r,.* 

t=i+i 

Tij  must  complete  at  EDij  units  of  time  after  Ti 
is  released  in  order  for  T,  as  a  whole  to  complete 
in  time. 

Table  4  lists  the  priorities  of  subtasks  in  Example  3 
with  their  priorities  assigned  based  on  their  effective 
relative  deadlines. 
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Table  4;  Example  2  -  Priority  Assignment  Based  on 
Subtasks’  Effective  Deadlines 


Step  3  :  Determine  the  worst-case  response 
times  for  subtasks 


In  this  equation  Hij  is  the  set  of  subtasks  that  (1)  are 
on  the  same  processor  as  Tij,  (2)  are  of  different  tasks 
than  Ti,  and  (3)  have  priorities  equal  to  or  higher  than 
Tij.  Hij  is  a  subset  of  Hij  in  which  every  subtask 
has  a  higher  priority  than  Tij.  Uij  is  the  processor 
utilization  factor  of  Tij .  Again,  0ij  is  the  maximum 
blocking  time  Tij  can  experience.  For  both  PCP  and 
SBP,  0ij  can  be  approximated  by  MAX(Sk,i),  where 
Sk,i  is  the  maximum  duration  of  critical  sections  for 
all  possible  Tk,i  that  (1)  is  on  the  same  processor  as 
Tij  and  (2)  has  lower  priorities  than  Tij. 

Step  4  :  Check  schedulability  for  the  whole 
system 

F^om  the  results  obtained  in  previous  step,  the 
worst-case  response  time  for  Ti  can  be  obtained  by 
summing  up  all  response  times  of  its  subtasks  ; 

Ci  =  Yl^ij  (2) 

j 

If  Ci  >  Di ,  where  Di  is  the  relative  deadline  of  task 
Ti,  we  report  failure  for  this  task  set.  If  all  tasks  pass 
this  test,  we  report  success. 


After  Step  2  we  have  a  set  of  subtasks  on  each  pro¬ 
cessor,  in  which  (1)  every  subtask  requires  either  no 
resource  or  local  resources  and  (2)  every  subtask  has 
a  fixed  priority.  Resource-access-control  protocols  for 
single-processor  systems  can  be  used  to  prevent  dead¬ 
locks  and  uncontrolled  priority  inversion.  Both  PCP 


4  Conclusions 

In  the  previous  section  we  present  a  procedure  for 
applying  the  end-to-end  approach  to  scheduling  tasks 
with  shared  resources  in  a  multiprocessor  system  and 
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analyzing  the  schedulability.  In  order  to  make  this  ap¬ 
proach  practical,  some  formulas  need  to  be  improved 
and  problems  which  may  arise  in  practice  need  to  be 
addressed.  For  example,  the  upper  bound  for  worst- 
case  response  time  given  by  Eq.  (1)  sometimes  is  not 
satisfactory,  especially  for  subtasks  with  low  priori¬ 
ties.  A  method  based  on  time-demand  analysis  has 
been  developed  to  give  a  much  tighter  hound  and  will 
be  presented  in  a  future  paper. 

Another  practical  problem  arises  when  we  fix  the 
subtasks’  phases  to  enforce  the  execution  precedence 
among  them.  In  order  to  make  the  modified  phases 
consistent  and  meaningful  in  a  multiprocessor  or  dis¬ 
tributed  system,  clocks  on  all  processors  have  to  be 
strictly  synchronized,  which  can  be  difRcult  to  achieve 
in  practice.  We  can  allow  some  clock  drift  among 
processors,  provided  that  the  drift  is  within  a  max¬ 
imum  limit  of  6  time  units.  Extra  8  time  units  can  be 
added  to  the  worst-case  response  time  for  each  subtask 
obtained  in  the  previous  section,  and  the  execution 
precedence  relations  among  subtasks  will  be  safely  en¬ 
forced. 

Another  solution  to  this  problem  is  to  use  dynamic 
phasing  for  subtasks  instead  of  static  phasing  used  in 
this  paper.  In  other  words,  a  subtask  can  be  triggered 
to  start  as  soon  as  its  previous  subtask  finishes.  We 
are  currently  working  on  the  schedulability  analysis 
for  such  systems. 

An  alternative  way  to  map  tasks  to  subtasks  is  to 
map  all  critical  sections,  both  for  local  resources  and 
for  remote  resources,  into  subtasks.  The  resultant  task 
system  has  end-to-end  processing  not  only  across  pro¬ 
cessors  but  also  within  each  processor.  A  study  in  [6] 
has  shown  that  schedulability  analysis  for  end-to-end 
processing  within  a  processor  is  possible  and  promis¬ 
ing.  We  are  currently  studying  the  schedulability  anal¬ 
ysis  for  such  systems. 

In  this  paper  we  assume  that  all  resources  accessed 
in  one  nested  critical  section  must  be  on  the  same 
processor.  This  assumption  in  general  can  be  overly 
restrictive.  We  will  address  this  problem  from  the 
point  of  view  of  both  resource  access  control  and 
task/resource  assignment.  Ideally  we  want  to  assign 
resources  to  processors  to  minimize  the  number  of 
nested  critical  sections  that  access  resources  on  more 
than  one  processor. 

In  many  ways,  the  end-to-end  scheduling  approach 
can  be  viewed  as  a  divide-and-conquer  approach:  it  di¬ 
vides  the  problem  by  mapping  the  given  task  set  onto 
an  end-to-end  task  set  where  each  processor  becomes 
relatively  independent.  It  then  resolves  the  local  re¬ 
source  contention  on  each  processor.  Finally  combines 


the  results  to  obtain  a  global  solution.  This  merit 
leads  to  a  reduction  in  the  complexity  of  the  resource 
contention  problem. 
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Abstract 

It  has  been  recognised  that  future  hard  real-time 
systems  need  to  be  more  flexible  than  current 
scheduling  theory  permits.  One  method  of  increasing 
flexibility  is  the  incorporation,  at  run-time,  of  optional 
components  into  processes  with  hard  deadlines.  Such 
components  are  not  guaranteed  offline,  but  may  be 
guaranteed  at  run-time  if  sufficient  resources  are 
available.  This  is  achieved  by  providing  mechanisms 
within  the  kernel  for  run-time  monitoring  of  spare 
processor  ct^iacity  and  its  subsequent  assignment  to 
requesting  processes.  This  paper  examines  these 
mechanisms  within  the  context  of  fixed  priority  pre¬ 
emptive  scheduling. 

1.  Introduction 

The  next  generation  of  hard  real-time  systems  need  to  be 
flexible,  adaptive  and  able  to  exhibit  intelligence.  This  is 
contrary  to  the  relatively  inflexible  and  static  approaches 
enforced  by  scheduling  theory  and  kernel  design  today: 
both  must  advance  before  such  flexibility  can  be  realised 
in  applications. 

One  method  of  achieving  improved  flexibility  is  the 
provision  of  optional  components,  not  afforded  offline 
guarantees,  but  executed  at  run-time  if  sufficient  resources 
are  available.  Classically,  such  components  are  embodied 
in  soft  real-time  processes,  executed  when  no  guaranteed 
(i.e.  hard)  process  is  runnable.  It  is  our  contention  that 
greater  flexibility  and  utility  is  obtained  by  permitting 
critical  processes  to  request  time  for  optional  components 
[4].  Such  components  may  be  bounded  and  need  to 
complete  once  started  or  be  guaranteed  a  minimum  time 
(unbounded  or  bounded).  Under  these  circumstances  it 
becomes  apparent  that  we  require  the  ability  to  guarantee, 
at  run-time,  execution  time  for  optional  components. 

The  focus  of  this  paper  is  upon  kernel  mechanisms  to 
provide  guaranteed  execution  time  for  optional 
components  of  hard  processes  within  fixed  priority  pre¬ 
emptive  systems. 

Assuming  that  in  safety-critical  systems  on-line 
guarantees  of  execution  time  for  optional  components 


cannot  be  provided  at  the  expense  of  reduced  predictability 
[4],  a  two-tier  view  of  scheduling  is  taken: 

•  initially,  offline  guarantees  are  given  to  hard 
processes,  providing  a  guaranteed  minimum  service; 

•  then,  on-line  guaranteed  execution  time  is  provided 
for  additional  components  at  run-time  without 
violating  the  guarantees  made  offline. 

The  first  tier  is  provided  by  fixed  priority  scheduling 
(rapidly  becoming  a  de  facto  standard  in  hard  real-time 
systems):  scheduling  theory  is  now  available  for  the 
sufficient  analysis  of  complex  process  sets  [2].  The  second 
tier  utilises  the  inherent  spare  processor  capacity  at  run¬ 
time  to  provide  additional,  guaranteed  execution  time,  to 
requesting  hard  processes.  Within  this  approach,  the 
following  issues  become  apparent: 

•  appropriate  programming  models  to  express 
additional  components; 

•  identification  of  spare  capacity  at  run-time; 

•  assignment  of  spare  capacity  to  requesting  processes. 
Many  programming  models  have  been  proposed,  e.g.  the 
Imprecise  Model  of  Liu  et  al  [10]  and  the  Unbounded 
Model  of  Audsley  et  al  [3].  However,  the  discussion  of 
such  models  is  beyond  the  scope  of  this  paper.  The  rest  of 
this  paper  discusses  kernel  mechanisms  to  support  the 
identification  and  assignment  of  spare  capacity  at  run¬ 
time  to  requesting  hard  processes  within  fixed  priority 
pre-emptive  systems. 

We  assume  that  the  kernel,  as  opposed  to  application 
processes,  is  responsible  for  the  identification  and 
assignment  of  spare  capacity  at  run-time.  This  is  in¬ 
keeping  with  Rushby's  view  that  a  kernel  for  a  safety- 
critical  system  should  contain  any  mechanisms  whose  use 
by  one  application  process  may  affect  another  application 
process  which  is  crucial  to  the  overall  operation  of  the 
system  [1 1].  In  contrast,  it  is  assumed  that  the  demand  for 
spare  capacity  is  application  process  oriented.  Amongst 
competing  processes,  a  kernel  policy  decides,  at  run-time, 
to  which  process  spare  capacity  is  assigned. 

The  remainder  of  this  introduction  details  our 
terminology.  Section  2  describes  mechanisms  for  on-line 
detection  of  spare  capacity.  Kernel  level  representation 
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and  management  of  detected  spare  capacity  are  discussed 
in  section  3.  Implementation  issues  are  discussed  in 
section  4,  with  section  5  offering  our  conclusions. 

1.1  Terminok^ 

Process  x,  has  (unique)  fixed  priority  i,  worst-case 
execution  time  (WCET)  C| ,  deadline  Dj ,  and  is  invoked  at 
periods  defined  by  7j  (for  sporadic  processes  7)  is  the 
minimum  inter-arrival  time).  The  set  of  processes  with 
higher  priorities  than  x,  is  given  by  kp(i),  the  set  of 
processes  with  equal  or  lower  priority  than  x^  is  given  by 
/p(().  The  set  of  higher  priority  levels  than  i  is  given  by 
hpl{i),  with  the  set  of  priority  levels  equal  to  or  lower 
than  i  is  given  by  /p/(i}. 

2.  Identification  Of  Spare  Capacity 

Given  the  need  to  provide  100%  deadline  predictability  for 
hard  processes,  it  is  inevitable  that  the  processor  and  other 
resources  will  be  under-utilised  at  run-time.  This  occurs 
for  many  reasons  [3],  including  pessimistic  WCET 
analysis,  hardware  speed-ups  (e.g.  cache,  pipeline)  etc. 
We  term  the  resources  not  required  at  run-time  as  spare 
capacity.  Two  forms  of  spare  capacity  may  be  identified: 
Gain  Time  -  processor  time  guaranteed  to  a  crucial 
process  offline  but  not  required  at  run-time. 

Slack  Time  -  processor  time  not  utilised 
guaranteed  executions  at  run-time. 

Since  gun  time  has  been  guaranteed  offline  as  part  of  a 
crucial  process's  WCET,  if  it  is  assigned  to  another 
process  at  run-time,  the  latter  inherits  the  guarantee 
afforded  to  the  original  process.  Section  2.1  discusses  a 
mechanism  for  the  detection  of  gain  time.  In  general,  a 
guarantee  cannot  be  afforded  to  a  process  assigned  slack 
time  (since  it  was  not  necessarily  guaranteed  to  a  process 
offline).  However,  mechanisms  exist  which  relax  this 
restriction.  One  example  is  discussed  in  section  2.2. 

2.1  GainPoinfs: 

A  Medtanisin  for  the  Detection  of  Gain  Time 

Several  iqiproaches  have  been  proposed  which  facilitate 
the  detection  of  gain  time.  Haban  et  al  place  software 
triggers  at  the  end  of  basic  blocks  in  process  code  to 
measure  actual  execution  time  [8].  This  is  then  compared 
with  the  pre-determined  WCET  of  the  block  to  calculate 
gain  time.  In  a  similar  way,  Dix  et  al  allow  the  insertion 
of  milestones  into  process  code  [7].  When  the  milestone  is 
reached  the  maximum  remaining  execution  time  of  a 
process  is  communicated  to  the  kernel.  The  motivation  of 
the  milestone  is  to  declare  the  point  in  its  computation 
where  it  has  completed  its  major  computation  (i.e.  only 
has  housekeeping  functions  remaining),  although  it  could 
be  used  to  detect  gain  time:  milestones  inserted  at  the  end 


of  basic  blocks  would  detect  gain  time  as  in  Haban's 
approach. 

The  property  of  both  these  approaches  is  that  gain  time 
is  detected  after  it  has  been  generated,  at  a  relatively 
coarse  granularity.  In  general,  we  wish  to  be  aware  of 
spare  capacity  as  early  as  possible  (implying  a  finer 
detection  granularity  than  Haban's  or  Dix's  approach):  the 
sooner  it  can  be  determined,  the  sooner  it  can  be  usefully 
utilised. 

Consider  the  following  program  fragment; 

IF  condition  THEN  16  units  ELSE  4  units  FI 

If  the  condition  holds,  the  WCET  of  the  statement  is  16 
units,  otherwise  4  units.  However,  WCET  analysis  will 
have  calculated  that  the  WCET  of  the  statement  is  16 
units.  Hence,  if  the  else  clause  is  executed,  12  units  of 
gain  time  will  become  apparent.  The  earliest  place  that 
this  gain  time  can  be  detected  is  after  the  condition  has 
been  determined,  prior  to  execution  of  the  else  clause. 
Here,  a  gain  point  is  inserted  into  the  code.  This  takes  the 
form  of  a  software  trigger,  informing  the  kernel  of  the 
gain  point  value  via  a  system  call  (to  the  kentel).  Where 
the  gain  point  has  a  fixed  value  it  is  termed  static. 

This  approach  can  be  extended  for  other  control-flow 
language  constructs.  For  example,  in  languages  suitable 
for  hard  real-time  systems  a  maximum  loop  count  is 
declared  at  compilation  time.  When  the  actual  number  of 
iterations  is  known  (either  prior  to  loop  entry  or  on  loop 
exit)  a  dynamic  gain  point  can  be  declared  whose  value  is 
equal  to  the  number  of  iterations  not  required,  multiplied 
by  the  WCET  of  the  loop  body. 

We  note  that  gain  time,  in  general,  is  detected  sooner 
using  gain  points  than  in  either  Haban  or  Dix's  approach. 
Further  uses  of  gain  points,  for  other  common  language 
structures,  are  given  in  [1].  The  gain  point  approach  is 
applicable  to  both  sporadic  and  periodic  processes. 

Gain  points  (i.e.  code  for  calculating  the  value  and 
making  the  kernel  call)  are  inserted  automatically  during 
compilation  or  WCET  analysis.  However,  it  remains 
possible  that  a  rogue  process  could  report  more  gain  time 
than  detected  (possibly  causing  failure  if  that  gain  time  is 
guaranteed  to  a  requesting  process).  To  detect  this,  when  a 
gain  time  call  is  made  the  kernel  evaluates  the  condition; 

C,^g,+A^+cr+v  (1) 

where  g,  is  the  gain  time  detected  by  x,  so  far  in  its 
current  execution;  4,  is  the  actual  execution  time  used  so 
far  by  the  current  execution;  c""  is  the  WCET  of  the 
process  code  from  the  gain  point  to  completion  (calculated 
during  WCET  analysis  and  parameterised  within  loops);  v 
is  the  amount  of  gain  time  reported  by  the  call.  If 
condition  (1)  does  not  hold,  the  amount  of  gain  time 
reported  by  the  process  is  erroneous. 
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Note  that  +^4,  is  the  WCET  of  the  process  up  to  the 
gain  point  and  c^  +  v  the  remaining  WCET  after  the 
gain  point. 

This  approach  can  be  extended  to  permit  the  effects  of 
pipelines  and  other  processor  accelerating  to  be  monitored. 
Assuming  that  the  condition  (1)  holds,  the  amount  of  gain 
time  due  to  these  hardware  features  is  given  by: 
e  =C,-(ft+>4,+c,""+v) 

Thus,  the  actual  gain  time  recorded  by  the  kernel  is  e-f  v. 

Implementation  is  discussed  in  section  4. 

22  Tlie  Approximate  Slack  Stealer: 

A  Mechanian  for  the  Detection  of  Slack  Time 

Assuming  that  gain  time  is  detected  on-line  via  gain 
points,  the  only  remaining  spare  capacity  is  slack  time, 
which  occurs  dynamically  at  run-time  if  the  utilisation  of 
the  system  is  less  than  100%.  Conventionally,  slack  time 
cannot  be  guaranteed  to  processes.  However,  recent 
research  has  shown  that  a  proportion  of  the  available  slack 
time  can  be  guaranteed  to  requesting  processes  at  run¬ 
time.  Essentially,  at  any  time,  the  amount  of  processor 
time  that  can  be  stolen  immediately  from  hard  processes  is 
calculated,  whilst  maintaining  the  offline  guarantees  given 
to  those  processes;  the  execution  of  hard  processes  is 
postponed,  allowing  other  non-guaranteed  executions  to 
occur.  The  amount  of  processor  time  that  can  be  stolen  is 
termed  the  slack. 

The  Optimal  Slack  Stealer  algorithm  relies  on  a  pre¬ 
computed  table  to  define  the  slack  present  at  each 
invocation  of  each  hard  process  [9].  The  method  suffers 
two  drawbacks:  only  periodic  processes  were  considered; 
the  size  of  the  table  is,  in  general,  non-polynomial.  These 
{Hoblems  were  addressed  by  the  Dynamic  Slack  Stealer 
[S],  which  enables  the  amount  of  slack  available  to  be 
calculated  on-line;  at  time  r,  the  amount  of  slack  at 
priority  level  i  is  equal  to  the  amount  of  time  not  required 
by  processes  hp(i)Kjx,  before  the  next  deadline  of  t,.  The 
amount  of  slack  that  be  guaranteed  to  a  process  at  priority 
level  i  is  the  minimum  slack  at  priority  levels  /p/(i)  [S]. 
This  approach  has  non-polynomial  complexity  - 
inappropriate  for  use  within  the  kernel. 

Based  upon  the  Dynamic  Slack  Stealer,  a  class  of  on¬ 
line  Approximate  Slack  Stealing  algorithms  can  be 
derived.  These  have  been  shown  to  be  more  effective  than 
bandwidth  preserving  algorithms  [6]. 

At  any  point  in  time.  Approximate  Slack  Stealing 
algorithms  provide  a  lower  bound  on  the  slack  time 
availidrle  (i.e.  for  a  given  interval,  the  slack  detected  is 
less  than  or  equal  to  the  exact  amount  of  slack).  We  now 
provide  the  derivation  of  one  such  algorithm.  The 
following  are  assumed  available  from  the  kernel  at  time  t: 

jc,(t)  -  the  earliest  possible  next  release  of  ; 


dj{t)  -  the  next  deadline  of  an  invocation  of  (if  the 
current  invocation  is  complete  then 
d,(/)=x,(f)+D(); 

c^(t)  -  the  remaining  (worst-case)  execution  time  of 
the  current  invocation  of  t, . 

Note  that  jc,(r)  and  d^lt)  are  measured  relative  to  time  t. 

The  exact  amount  of  slack  time  available  at  priority 
level  1  in  [/,/+d,(/))  whilst  guaranteeing  T,  meets  its 
deadline  can  be  found  by  viewing  the  interval  as 
comprising  a  number  of  level  i  busy  and  idle  periods  (i.e. 
periods  where  processes  of  priority  i  or  higher  are 
executing  or  not,  respectively).  Any  level  i  idle  time  in  the 
interval  can  be  swapped  for  computation  without 
causing  the  deadline  to  be  missed.  A  lower  bound  on  this 
level  i  idle  time  is  found  by  obtaining  an  upper  bound  on 
the  time  processes  Xj  shp{i)\JX,  require  in  [t,t+dj{t)). 
The  maximum  computation  that  Xj  performs  in  the 
interval  is  given  by*: 

Ijit.d,{t))=  Cj(t)+fj(tM())Cj  + 


min  (Cj,{dXt)-Xj(t)-fj{t,dXt))Tj)J 
Where  fj(t,di(t))  is  the  number  of  complete  invocations 
ofT^  in  [f,r+di(t)): 


Thus,  lj{t,di(t))  comprises  three  components:  Xj 
computation  outstanding  at  t;  fj{t,dj{t))  complete 
invocations  of  x^  and  a  partially  complete  final 
invocation.  A  lower  bound  on  the  level  i  slack  at  time  r, 
Sjit),  is  given  by: 

r  ^ 


m  = 


I  IjMO) 


Jq 

To  enable  the  assignment  of  available  slack  time  to 
requesting  processes,  we  must  ensure  that  all  hard 
processes  still  meet  their  deadlines.  Hence,  the  maximum 
amount  of  slack  that  can  be  guaranteed  at  priority  level  i 
(i.e.  assigned  to  a  requesting  process)  is  given  by: 

SJ^lt)  =  min  Sj{t) 


The  approach  is  applicable  to  periodic  and  sporadic 
{Htx^esses.  In  general,  the  complexity  of  the  approach  is 
0(n)  for  determining  the  slack  at  one  priority  level  and 
0(n^+n)  for  determining  the  slack  at  all  levels.  It  is  noted 
that  the  latter  is  bounded,  therefore  appropriate  for  use  by 
the  kernel.  Approximate  Slack  Stealing  algorithms  have 
been  explored  further  by  Davis  [6].  Implementation  issues 
are  discussed  in  section  4. 


*  (x)o  represents  niax(jc.O):  its  minimum  value  is  thus  0. 
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3.  Management  Of  Spare  Capacity 

In  this  section  show  how  the  spare  capacity  detected  by  the 
two  mechanisms  can  be  integrated  into  a  joint  framework 
to  allow  efficient  assignment  to  requesting  processes. 

3.1  Representation  erf  Spare  Capacity 

We  assume  that  the  values  x^O),  d,(r),  A^,  S^t)  and  g, 
are  held  by  the  kernel,  together  with,  g*.  the  amount  of 
gain  time  assigned  from  priority  level  i  during  the  current 
execution  of  Xj .  Thus,  the  storage  requirements  are  0(n) 
in  the  number  of  processes. 

3J  Management  ofSpare  Capacity  Tuples 

Since  gain  time  has  been  guaranteed  at  a  particular 
priority  level  by  offline  feasibility  analysis,  in  general,  it 
must  be  utilised  in  preference  to  the  normal  execution  of  a 
lower  priority  process.  Otherwise,  one  of  the  fundamental 
assumptions  of  fixed  priority  feasibility  analysis,  that 
processes  execute  as  soon  as  possible,  is  violated  with  the 
possibility  that  guaranteed  deadlines  may  be  missed.  Also, 
values  of  S,(r)  must  be  updated  as  time  progresses:  if  x, 
executes,  the  amount  of  slack  available  at  hpl{i)  must  be 
decreased  by  the  amount  of  execution  of  by  x, . 

The  underlying  management  issue  is  of  correctly 
ageing  spare  capacity.  This  can  be  achieved  by  modifying 
the  fundamental  principle  contained  in  the  Priority 
Exchange  algorithm  [12]:  if  gain  time  exists  at  priority 
level  i  when  x,  completes,  the  gain  time  at  priority  level  i 
is  moved  to  priority  level  i- 1 . 

Whilst  updating  of  available  gain  time  and  slack  time 
could  occur  every  time  unit,  the  overhead  would  become 
large  (without  hardware  support).  Alternatively,  updates 
could  occur  whenever  a  demand  for  spare  capacity  is 
made.  However,  this  would  require  that  the  kernel  record 
the  executing  processes  and  the  amount  of  time  they 
executed  since  the  last  update.  This  is  complex  since, 
potentially,  many  invocations  of  the  guaranteed  processes 
may  occur.  The  approach  adopted  in  this  paper  is  to 
update  the  spare  capacity  at  a  context  switch  or  request  for 
spare  capacity  (other  approaches  are  given  in  [1]). 

At  an  update,  let  the  executing  process  be  x, ,  with  the 
elapsed  time  since  the  previous  update  be  E  (x,  will  be  the 
only  executing  process  in  this  interval).  To  update  the 
available  spare  capacity,  the  kernel  must: 

(1)  decrease  the  available  slack  at  hpl{i)  by  E; 

(2)  if  Xj  has  completed,  increase  the  slack  at  levels 
Ipl(i)  by  ft-g*  and  set  g,  =0  and  g*=0. 

Note,  if  the  processor  was  idle  in  the  interval  since  the  last 
update,  the  slack  at  priority  levels  Ipl(l)  is  decreased  by  E. 

By  using  the  above  rules,  the  guarantees  associated 
with  gain  time  and  slack  time  are  maintained,  without 
causing  deadline  failure  of  other  processes. 


Note  that  when  the  Approximate  Slack  Stealing 
algorithm  is  executed,  the  value  of  c,(/)  is  required.  The 
management  approach  adopted  above  implies  that  since 
gain  time  is  held  separately  to  slack  time,  the  value  of 
c,(/)  used  when  calculating  slack  is  given  by: 
q(t)  =  q-A^ 

This  management  approach  increases  context  switch 
time  by  a  bounded  amount,  0(n)  in  the  number  of  hard 
processes. 

33  Assigning  Spare  Capacity 

Given  the  management  of  spare  capacity  as  described 
above,  the  assignment  of  spare  capacity  to  requesting 
processes  becomes  relatively  straightforward.  Consider 
hard  process  x,  requesting  additional  computation  time  at 
time  /  before  its  deadline  at  t+d^U)-  A  number  of  sources 
of  guaranteed  spare  capacity  may  be  checked,  including: 

(1)  check  gain  time  at  priority  level  i  (if  gj>g*  then  gain 
time  is  available  for  assignment); 

(2)  check  gain  time  at  priority  levels  hpl(i) ; 

(3)  check  gain  time  at  priority  levels  jelpl{i)  where 
t+dj(t)<t+d,{ty, 

(4)  check  slack  time  at  priority  levels  Ipl(i)  (i.e. 
evaluate 

(5)  if  the  time  at  which  the  slack  at  priority  levels  IplU) 
was  last  calculated  is  prior  to  t,  we  may  check  for  the 
possibility  of  additional  slack  by  recalculating  the 
slack  for  priority  levels  IplU),  then  repeating  (4). 

Clearly,  more  than  one  of  the  above  can  be  used  in  the 
provision  of  spare  capacity  to  a  single  request. 

The  complexity  of  (1)  is  0(1).  In  general,  the 
complexity  of  (2),  (3)  and  (4)  are  0(,n),  with  (5)  being 
0{n^+n).  A  fast  decision  can  be  made  using  (1),  with 
greater  cost  incurred  by  the  other  approaches,  although  a 
greater  amount  of  gain  time  may  be  found  with  an 
associated  increased  chance  of  being  able  to  honour  the 
request  for  spare  capacity. 

An  0(n)  approach  for  guaranteeing  an  amount  of 
computation  before  an  arbitrary  deadline  (i.e.  not 
necessarily  the  deadline  of  the  requesting  process)  is  given 
in  [14]. 

After  spare  capacity  has  been  assigned,  adjustments  to 
the  available  spare  capacity  must  be  made.  If  gain  time  at 
priority  level  i  is  assigned,  g*  is  increased;  if  slack  time  at 
priority  level  i  is  assigned,  the  available  slack  at  lpl{i)  is 
decreased. 

4.  Implementation  Issues 

It  has  been  argued  that  the  requirement  for  small  and 
bounded  kernel  overheads  precludes  the  use  of  complex 
on-line  scheduling  techniques  [13].  It  is  the  authors' 
contention  that  to  provide  increased  flexibility  via  re-use 
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of  spare  capacity  requires  relatively  complex,  but  bounded, 
on-line  kernel  algorithms  to  be  employed.  The  reliability 
of  the  kernel  should  not  be  affected  if  it  is  constructed 
using  the  same  engineering  procedures  as  employed 
during  the  application  software  life-cycle.  The 
fundamental  issue  is  not  one  of  on-line  complexity,  but  of 
being  able  to  guarantee  the  feasibility  of  processes  offline. 
Hence,  the  use  of  on-line  scheduling  must  minimise  any 
effects  it  has  upon  process  feasibility. 

The  number  of  gain  points  inserted  into  process  code 
can  be  high.  Ideally,  their  insertion  should  not  affect  the 
feasibility  of  a  process  by  increasing  its  WCET.  This  is 
not  always  possible  unless  some  gain  points  are  omitted. 
However,  overheads  can  be  reduced  by  only  inserting  gain 
points  whose  value  will  be  at  least  the  cost  of  the  gain 
time  kernel  call.  In  this  case,  some  gain  time  will  be 
detected  late,  although  will  show  up  as  an  efficiency  speed 
up  next  time  a  gain  point  kernel  call  is  made.  Now, 
assuming  that  all  processes  are  feasible,  any  gain  points 
not  originally  inserted  could  now  be  placed  into  process 
code  whilst  the  system  remains  feasible.  Alternatively, 
gain  time  could  be  reported  at  a  context  switch  or  request 
for  spare  capacity  (i.e.  at  an  update  -  see  section  3.2)  [1]. 

The  execution  of  the  Approximate  Slack  Stealing 
algorithm  need  only  occur  when  insufficient  spare 
capacity  is  available  for  outstanding  requests.  Then,  the 
algorithm  is  only  executed  if  sufficient  spare  capacity 
exists  to  execute  the  algorithm  itself.  Thus,  the  feasibility 
of  processes  is  not  affected  by  executions  of  the  algorithm. 
We  note  that  the  algorithm  can  also  execute  instead  of 
idling  the  processor. 

Whilst  the  above  reduces  the  effects  of  spare  capacity 
detection  and  assignment  upon  the  feasibility  of  processes, 
some  other  overheads  must  be  taken  into  account,  e.g. 
spare  capacity  updates.  The  approach  of  section  3.3 
implies  that  additional  information  must  be  collected  and 
stored  at  run-time.  Also,  the  manipulation  of  the  run 
queue  may  become  more  complex  when  a  process  is 
assigned  spare  capacity  at  a  different  priority  level  to  its 
own;  the  effects  of  this  can  be  reduced  by,  for  example, 
restricting  a  process  to  spare  capacity  at  one  priority  level 
other  than  its  own. 

5.  Conclusions 

This  paper  has  illustrated  how  optional  components,  not 
guaranteed  offline,  can  be  guaranteed  computation  time  at 
run-time  (if  sufficient  spare  capacity  is  available).  Spare 
capacity  is  detected  by  the  Gain  Point  mechanism  and  the 
Approximate  Slack  Stealing  algorithm.  The  management 
of  detected  spare  capacity  has  been  described,  enabling  its 
efficient  assignment  to  requesting  processes  at  run-time. 
The  approach  described  can  be  extended  for  more  flexible 
process  characteristics  [1,5,6],  e.g.  resource  sharing  (i.e. 


process  blocking),  precedence  constrained  processes;  and 
process  release  jitter. 

Currently,  the  mechanisms  described  within  this  paper 
are  being  incorporated  into  the  DrTEE  hard  real-time 
kernel.  The  initial  results  indicate  that  the  feasibility  of 
processes  is  not  unduly  affected  by  using  these 
mechanisms,  with  the  ability  to  re-use  spare  capacity  at 
run-time  adding  to  the  amount  of  computation  time 
available  to  hard  processes  at  run-time,  so  improving  the 
utility  and  flexibility  of  the  resultant  system. 
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Abstract 

A  new  online  task  assignment  scheme  is  presented 
for  mnHiproeessor  sgstems  where  individual  proces¬ 
sors  execute  the  rate-monotonic  scheduling  algorithm. 
The  computational  complezitg  of  the  task  assignment 
scheme  grows  linearlg  with  the  number  of  tasks,  and 
its  performance  is  shown  to  be  significantlg  better  than 
previously  existing  schemes.  The  superiority  of  the 
assignment  scheme  is  achieved  by  a  new  schedulabil- 
ity  condition  derived  for  the  rate-monotonic  scheduling 
discipline. 


1  Introduction 

Rate-monotonic  (RM)  scheduling  is  becoming  a 
viable  scheduling  discipline  for  real-time  systems. 
Through  the  years,  researchers  have  successfully  ap¬ 
plied  this  discipline  to  tackle  a  number  of  practical 
problems,  such  as  task  synchronization,  bus  schedul¬ 
ing,  joint  scheduling  of  periodic  and  aperiodic  tasks, 
and  transient  overload  [4,  9].  This  is  done  through 
developing  various  scheduling  algorithms  to  cope  with 
situations  that  are  not  covered  by  the  rate-monotonic 
algorithm. 

While  rate-monotonic  scheduling  is  optimal  for 
uniprocessor  systems  with  fixed-priority  assignments, 
it  is,  unfortunately,  not  so  for  multiprocessor  systems. 
In  fact,  the  problem  of  optimally  scheduling  a  set  of 
periodic  tasks  on  a  multiprocessor  system  using  ei¬ 
ther  fixed-priority  or  dynamic  priority  assignments  is 
known  to  be  intractable  [6].  Hence,  any  practical  so¬ 
lution  to  the  problem  of  scheduling  real-time  tasks  on 
multiprocessor  systems  presents  a  trade-off  between 
computational  complexity  and  performance.  Heuris¬ 
tic  algorithms  have  been  shown  to  deliver  near-optimal 
solutions  with  limited  computational  overhead. 


In  this  study,  we  are  concerned  with  developing  an 
efficient  heuristic  algorithm  for  scheduling  a  set  of  pe¬ 
riodic  tasks  on  a  multiprocessor  system.  The  general 
solution  to  such  a  problem  involves  two  algorithms: 
one  to  schedule  tasks  assigned  on  each  individual  pro¬ 
cessor,  and  the  other  to  assign  tasks  to  the  processors. 
In  the  following,  we  only  consider  multiprocessor  sys¬ 
tems  where  each  processor  executes  the  RM  scheduling 
algorithm. 

For  the  assignment  of  tasks  to  processors,  one  dis¬ 
tinguishes  offline  and  online  algorithms.  If  the  entire 
task  set  is  known  a  priori,  the  scheduling  method  is 
referred  to  as  being  offline,  otherwise  it  is  said  to  be 
online.  The  task  assignment  scheme  presented  here 
beiongs  to  the  class  of  online  algorithms. 

Since  real-time  systems  often  operate  in  dynamic 
and  complex  environments,  many  scheduling  decisions 
must  be  made  online.  For  example,  a  change  of  mis¬ 
sion  may  require  the  execution  of  a  totally  different 
task  set.  Or  the  failure  of  some  processors  may  render 
a  re-assignment  of  tasks  necessary.  In  these  scenar¬ 
ios,  the  entire  task  set  to  be  scheduled  may  change 
dynamically,  that  is,  tasks  must  be  added  or  deleted 
from  the  task  set. 

Previous  work  on  this  problem  illustrates  the  trade¬ 
off  between  computational  complexity  and  perfor¬ 
mance  of  heuristic  task  assignment  schemes.  The  com¬ 
plexity  of  an  algorithm  is  given  by  the  upper  bound 
of  the  time  required  to  schedule  a  set  of  A  tasks.  The 
performance  of  task  assignment  schemes  is  evaluated 
by  providing  worst  case  bounds  for  N/Nopt,  where 
N  is  the  number  of  processors  requited  to  schedule 
a  task  set  with  a  given  heuristic  method,  and  Ngpt  is 
the  number  of  processors  needed  by  an  optimal  assign¬ 
ment.  Bounds  for  the  existing  schemes  are  determined 

by  limjv,,,  -♦oo  IV/JVopi. 

Davari  and  Dhall  presented  an  online  task  as¬ 
signment  algorithm  with  a  computational  com- 
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plenty  of  0(K)  and  a  performance  bound  of 
N/Nept  =  2.28  [2].  Oh  and  Son  devel¬ 
oped  two  scheduling  algorithm  in  [8].  The  algorithms 
have  a  time  complexity  of  0(K  log  K),  and  worst  case 
performance  bounds  of  limAr.„->oo  N/Ngpt  =  2.33  and 
2.66,  respectively.  In  both  studies,  the  authors  ap¬ 
ply  variants  of  well-known  heuristic  bin-packing  algo¬ 
rithms  where  the  set  of  processors  is  regarded  as  a  set 
of  bins  The  decision  whether  a  processor  is  full 
is  determined  by  a  scbedulability  condition.  All  as¬ 
signment  schemes  in  [2,  8]  are  based  on  the  sufficient 
schedulability  condition  for  uniprocessor  systems  de¬ 
rived  in  [7]  and  its  variants,  e.g.,  [3].  Thus,  these  as¬ 
signment  schemes  differ  mainly  in  the  choice  of  the 
bin-packing  heuristic. 

Our  approach  for  developing  a  task  assignment 
scheme  for  multiprocessor  systems  is  different  from 
previous  work.  Rather  than  increasing  the  level  of  so¬ 
phistication  of  the  bin-packing  heuristic,  we  have  de¬ 
veloped  a  tighter  schedulability  condition  that  allows 
us  to  assign  more  tasks  to  each  processor  [1].  If  the  pe¬ 
riods  of  tasks  are  sufficiently  close,  we  could  show  that 
each  processor  can  be  almost  fully  utilized.  Based  on 
the  new  schedulability  condition  we  present  a  novel 
online  scheme  for  assigning  task  to  a  multiprocessor 
system.  The  complexity  of  our  assignment  scheme  is 
given  by  0(K)  and  the  worst  case  performance  bounds 
is  shown  to  be  limjv«,«-«oo  N/N^pt  =  1/(1  —  «)<  where 
a  is  an  upper  bound  for  the  load  factor  of  any  single 
task. 


2  Task  Model  and  Schedulability  Con¬ 
dition 

We  assume  that  the  real-time  computer  system  con¬ 
sists  of  an  homogeneous  multiprocessor  system  and  a 
set  of  K  real-time  tasks.  The  multiprocessor  and  the 
task  set  are  characterized  as  follows. 

A  real-time  task  is  denoted  by  Tt  =  (Ci.Tj)  (t  = 

1 . K).  Ti  denotes  the  shortest  time  between  two 

requests  of  task  rj,  and  is  referred  to  as  the  period  of 
Ti.  Ci  denotes  the  maximum  execution  time  of  task 
n-  Since  we  assume  that  the  multiprocessor  system 
is  homogeneous  the  execution  time  of  is  identical 
on  each  processor.  Each  request  for  a  real-time  task 
must  complete  execution  before  the  next  request  of  the 
same  task.  Thus,  in  the  worst  case,  the  execution  of 
Ti  must  be  completed  after  Ti  time  units.  The  period 

*The  bin-paddng  problem  i*  concerned  with  packing 
different-iised  items  into  fixed-used  binauaing  the  least  r'lmber 
of  bins  [S]. 


and  the  maximum  execution  time  of  task  r,-  satisfy 

Ti>0,  Q<Ci<Ti,  1  =  1,...,* 

We  will  refer  to  Ui  =  Ci/Ti  as  the  load  factor  of  the 
t-th  task,  and  to  [/  =  ^ 

task  set.  We  define  a  to  be  an  upper  bound  for  the 
load  factor  of  any  task,  i.e.,  a  >  maxi<i<K  •  Pn 
denotes  the  atUizaiion  of  the  n-th  processor,  that  is, 
the  sum  of  the  load  factors  of  the  tasks  assigned  to 
processor  n.  Tasks  are  grouped  into  M  classes,  and 
only  tasks  from  the  same  class  can  be  assigned  to  the 
same  processor. 

Next  we  present  a  new  sufficient  schedulability  con¬ 
dition  for  a  processor  that  schedules  tasks  with  the 
RM  algorithm.  The  result,  presented  in  Theorem  1, 
is  a  simple  modification  to  the  schedulability  condi¬ 
tion  for  uniprocessor  systems  by  Liu  and  Layland  [7]. 
Our  condition  yields  a  higher  utilization  of  the  proces¬ 
sor  if  the  task  periods  satisfy  certain  constraints.  On 
a  uniprocessor  system,  Theorem  1  does  not  provide  a 
significant  improvement  for  scheduling  real-time  tasks. 
For  multiprocessor  scheduling,  however,  we  can  divide 
a  large  task  set  into  subsets  in  such  a  way  that  we 
can  make  use  of  the  sharpened  condition  on  all  but 
possibly  M  processors. 

The  schedulability  condition  presented  in  the  fol¬ 
lowing  theorem  takes  advantage  of  a  special  property 
of  the  RM  scheduling  algorithm.  We  show  that  we 
can  increase  the  processor  utilization  if  all  periods  in 
a  task  set  have  vsdues  that  are  close  to  each  other.  The 
proof  of  the  theorem  can  be  found  in  [1]. 

Theorem  1  Given  a  real-time  task  set  ti,...,tk. 
For  i  =  l,...,K,  define 

Si  :=  logjTi  -  [logjTjJ  and  (1) 
:=  max  Si  -  min  S<  (2) 

l<i<K  !<•<* 

A  task  set  with  <  I  —  l/K  can  be  feasibly  sched¬ 
uled  by  the  Rate-Monotonic  algorithm  if  the  total  load 
satisfies  (S).  The  condition  is  tight. 

£/<(/f-l)(2'*/(*-‘>-l)-|-2^-'’-l  (3) 

Note  that  the  condition  given  by  (3)  b  tighter  than  the 
one  given  by  Liu  and  Layland  [7]  under  <1-  l/K. 

Corollary  1  Given  a  set  of  real-time  tasks  ti,...,tx. 
If  the  total  load  satisfies  U  <  max{ln2, 1  —  ^ln2}, 
then  the  task  set  can  be  scheduled  on  one  processor, 
where  is  as  defined  above  in  (S). 
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3  An  Online  Task  Assignment  Scheme 

Our  new  scheme  is  based  on  the  schedulability  con¬ 
dition  of  Theorem  1.  The  parameter  used  for  the 
scheme,  M,  denotes  the  number  of  processors  to  which 
a  new  task  can  be  assigned.  Recall  that  tasks  are  di¬ 
vided  into  M  classes.  The  class  membership  of  a  task 
r  is  determined  by  the  following  expression: 

m  =  [M(log,  (T)  -  Llog,  (T)J)  J  1  (4) 

Each  processor  is  assigned  tasks  from  only  one  class. 
Thus,  at  each  processor  the  value  of  as  defined  i” 
(2)  is  bounded  above  by  1/A/.  For  each  class,  the 
scheme  keeps  one  so-called  current  processor.  If  a  new 
task  from  class  m  is  added  to  the  task  set,  the  scheme 
first  attempts  to  accommodate  the  task  to  the  current 
processor  for  class  m.  A  complete  description  of  the 
algorithm  for  assigning  a  task  r  =  (C,  T)  is  given  in 
Algorithm  1. 

In  Algorithm  1,  adding  a  new  task  r  =  (C,T)  is 
accomplished  in  the  following  manner.  First,  the  class 
membership  of  the  new  task  r  (Step  1)  is  determined. 
If  r  can  be  added  to  the  current  processor  of  class  m 
without  violating  the  schedulability  condition  it  is  as¬ 
signed  to  this  processor.  Otherwise,  r  is  assigned  to  an 
empty  processor.  If  the  load  factor  of  r  is  sufficiently 
small  (Step  4),  the  processor  to  which  r  is  assigned  be¬ 
comes  the  current  processor  of  class  m  (Step  5).  If  the 
load  factor  of  r  is  large,  no  other  task  will  be  assigned 
to  this  processor  (Step  7). 


Global  functions: 

curr(m)  :  Returns  current  processor  for  class  m. 
newprocO  :  Returns  index  of  an  empty  processor. 
Add  (r  =  (C,T)) 

1.  m  :=  [A/(log,(T)-  Llog,(T)J)J  4- 1; 

2.  if  (Peurr(m)  +  C/T  <  1  -  In  2/M)  then 

3.  Feurr(m)  •—  Peurr(m)  4-  C/T', 

4.  else  if  (C/T  <  Pe„rr(m))  then 

5.  curr(m)  :=  ncu;proc();  Peurr(m)  ■=  C/T; 

6.  else 

7.  X  :=  newprocQ;  ;=  C/T; 

8.  endif 


Algorithm  1.  Online  Task  Assignment. 

The  performance  bounds  of  our  scheme  are  given 
in  Theorem  2  and  Corollary  2.  Corollary  2  states  the 
asymptotic  bound. 


Theorem  2  If  a  task  set  is  scheduled  by  Algorithm  1, 
then  the  number  of  processors  needed  satisfies  (5)  and 
(6).  Inequality  (5)  is  tight  t/a  <  (1  —  ln2/A/)/2,  and 
inequality  (6)  is  tight  if  a  >  (1  -  In2/A/)/2. 


N< 


U 


l-ln2/A/-o 


+  M 


N< 


2U 


1  -  In  2/A/ 


+  M 


(5) 


(6) 


Proof.  The  schedulability  condition  used  in  Step  2  of 
Algorithm  1  enforces  that  at  any  instant,  the  load  on 
all  processors  but  the  Af  current  processors  exceeds 
both  1  —  In2/A/  —  a  and  (1  —  ln2/A/)/2.  O 

Corollary  2  Let  {rj  1 1  =  1, 2, . . .}  be  a  given  infinite 
task  set.  Denote  by  U{k)  the  sum  of  the  load  fac¬ 
tors  of  the  first  k  tasks.  Denote  by  NulM)  num¬ 
ber  of  processors  used  by  Algorithm  1,  and  by  Noptik) 
the  number  of  processors  used  by  an  optimal  scheme. 
If  U [k)  =  oo  then  we  obtain  the  asymptotic 

bounds  (7)  and  (8).  The  bounds  are  tight. 


Uik) 


lim 

t-*oo  Nu{k) 

^Atik) 

am  -r; — TT-r  <  min 

*-►«>  Nopt{k) 


f,  ,  1-In2/A/1 

>  max<  1  —  In2/A/ —  O' ,  - - - > 


(7) 


In2/A/-Q  ’  1-In2/A/J 
(8) 


Proof.  We  obtain  both  (7)  and  (8)  by  passing  to  the 
limit  in  (5)  and  (6).  □ 

F>om  the  derived  bounds  we  see  that  the  performance 
of  Algorithm  1  is  sensitive  to  the  selection  of  M,  the 
number  of  task  classes.  The  asymptotic  bounds  in 
(7)  and  (8)  improve  for  large  values  of  M.  However, 
A/  also  determines  the  number  of  current  processors, 
i.e.,  processors  which  are  not  fully  utilized.  Next  we 
present  a  method  for  selecting  an  appropriate  value  of 
M  for  in  equations  (5)  and  (6). 

Assume  that  the  total  lostd  of  the  task  set  is  known. 
To  find  the  value  of  M  that  gives  the  best  worst-case 
bound  for  the  number  of  processors  in  (6),  we  fix  the 
value  of  U  in  (6).  Since  the  right  hand  side  of  (6)  is 
a  strictly  convex  function  of  A/,  we  can  calculate  the 
unique  minimum  which  is  denoted  by  A/* : 

A/*  =  v'2t^ln2-|-ln2  (9) 

This  suggests  that  we  should  choose  M  ~  VU.  Then 
we  obtain 

U/N  >  izi£^(l  _  m/N)  1/2  -  0(l/v^) 

(10) 


aod  hence 

JV/JV,p.  <2  +  0(1/\/c7)  (11) 

Similarly,  we  can  minimize  the  right  hand  side  of  (5) 
over  M  and  obtain  that  the  optimal  choice  for  M 
should  be  as  close  as  possible  to 

M’  =  — -j-"--  (12) 

If  we  choose  M  ~  VU,  we  obtain  with  (5)  the  follow¬ 
ing  bound  for  the  average  utilization  at  each  processor. 

U/N  >  (l-ln2/M-a)(l-M/Af)  l-a-0{l/y/U)  . 

(13) 

and  N/Nopt  is  given  by 

N/Nop,  <  1/(1  -  o)  +  OiUVU)  (14) 

4  Average-Case  Performance  Evalua¬ 
tion  of  the  New  Scheme 

While  a  worst-case  analysis  assures  that  the  perfor¬ 
mance  bound  is  satisfied  for  any  task  set,  it  does  not 
provide  insight  into  the  average-case  behavior  of  the 
assignment  scheme.  To  gain  insight  into  the  average- 
case  behavior  of  Algorithm  1,  we  conduct  some  simu¬ 
lation  experiments. 

Our  simulations  consider  large  task  sets  with  100  < 

K  <  1000  tasks.  In  each  experiment,  we  vary  the  value 
of  parameter  M,  the  number  of  task  classes.  The  task 
periods  are  assumed  to  be  uniformly  distributed  with 
values  1  <  7i'  <  500.  The  execution  times  of  the  tasks 
are  also  taken  from  a  uniform  distribution  with  ramge 
0  <  Ci  <  Ti/2.  Thus,  o,  the  maximum  load  factor 
of  any  task,  is  given  by  a  =  1/2.  The  performance 
metric  in  all  experiments  is  the  number  of  processors 
required  to  assign  a  given  task  set. 

We  compare  our  scheme  with  the  o.<line  assignment 
scheme  by  Davari  and  Dhall  [2],  NF-M.  Recall  that 
NF-M  also  has  linear  computational  complexity.  The 
outcome  of  the  simulation  experiments  is  shown  in 
Figure  1.  Since  an  optimal  task  assignment  cannot 
be  calculated  for  large  task  sets,  we  use  the  total  load 
(U  =  52^1  ^  obtain  a  lower  bound  for  the  num¬ 

ber  of  processors  required.  The  maximum  number  of 
task  classes  is  set  to  M  =  10,  20,  30,  respectively.  Each 
data  point  in  the  figure  depicts  the  average  value  of 
15  independently  generated  task  sets  with  identical 
parameters.  Note  that  for  all  values  of  M,  our  scheme 
gives  superior  performance  over  the  existing  one. 


Figure  1:  Task  Sets  with  a  =  0.5. 
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Abstract 

The  construction  of  a  real-time  system  on  heteroge¬ 
neous  hardwEire  platforms,  forces  one  to  make  choices 
on  which  programming  language,  operating  system, 
development  process  and  application  programmers  in¬ 
terface  to  use.  The  application  (a  micro-satellite)  re¬ 
quirements  state  that  the  system  must  be  dependable 
in  a  remote  and  harsh  environment  such  as  space.  This 
paper  will  detail  the  choices  made  and  the  experience 
gained  from  living  with  the  choices  made  in  the  devel¬ 
opment  of  a  micro-satellite  and  its  associated  ground 
support.  The  emphasis  is  on  simple  solutions  through¬ 
out.  The  simple  solution  is  important  for  the  verifica¬ 
tion  and  validation  of  the  complete  system. 

1  Introduction 

The  construction  of  a  real-time  system  on  heteroge¬ 
neous  hardware  platforms,  forces  one  to  make  choices 
on  which  programming  language,  operating  system, 
development  process  and  application  programmers  in¬ 
terface  to  use.  One  has  the  option  of  going  with  main 
stream  products  or  considering  products  with  features 
of  specific  relevance  to  the  application. 

The  application  (a  micro-satellite)  requirements 
state  that  the  system  must  be  dependable  in  a  re¬ 
mote  and  harsh  environment  such  as  space.  This  re¬ 
quirement  in  addition  to  the  heterogeneous  platforms 
found  in  the  space  and  the  ground  segment  of  the  mi¬ 
cro  satellite,  complicates  the  choice  of  a  development 
environment  and  target  executable  environment. 

This  paper  will  detail  the  choices  made  and  the 
experience  gained  from  living  with  the  choices  made  in 
the  development  of  a  micro-satellite  and  its  associated 
ground  support.  This  project  has  been  running  for  two 
years  and  is  entering  the  software  intensive  phase. 

Heterogenoues  computing  implies  increased  cost.  It 
has  however  been  shown  in  [1]  that  a  careful  applica¬ 


tion  of  hardware  and  software  diversity  can  be  cost 
effective. 

The  contribution  of  this  paper  is  showing  that  the 
emphasis  on  simple  solutions  throughout  provides  an 
environment  which  is  better  suited  for  developing  de¬ 
pendable  real-time  systems.  Simple  programming  lan¬ 
guages,  operating  systems,  development  lifecycles  and 
application  programmer  interfaces  all  form  part  of  the 
proposed  solution. 

The  simple  solution  is  important  for  the  verification 
and  validation  of  the  complete  system.  Further  more, 
the  nature  of  the  project  environment  causes  us  a  50% 
manpower  turnaround  every  year  ^ .  The  maintenance 
of  the  software  must  not  take  more  than  20%  of  the 
man  hours  available. 

The  paper  will  begin  by  describing  the  hardware 
and  software  required  for  the  application  domain. 
Each  of  the  areas  in  which  a  decision  had  to  be  meide 
will  be  discussed  in  turn  with  all  the  options  avail¬ 
able,  the  final  choice  made  and  the  reasons  for  doing 
so.  The  paper  will  close  by  reporting  the  experience 
we  have  had  with  our  approach  to  constructing  a  het¬ 
erogenoues  real-time  system. 


2  The  application  requirements 

The  application  softweire  is  for  the  space-  and 
ground  segments  of  a  micro  satellite  constellation. 
The  micro  satellite  is  a  45cm  cube  box,  weighing  in 
at  50kg.  The  strict  mass  and  power  budgets  places 
constraints  on  the  flight  control  hardware  which  has  a 
profound  effect  on  the  computing  resources  available. 
The  real-time  system  of  interest  run  on  the  space  seg¬ 
ment  and  the  supporting  ground  segment. 


*  Graduate  students  are  taken  in  every  year  and  graduate 
after  two  years 
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2.1  Hardware  platforms 

IVaditiona]  dependable  hardware  for  use  in  space, 
is  in  the  form  of  Triple  Modular  Redundancy  (TMR). 
Each  similar  module  is  of  high  cost  due  to  the  relia¬ 
bility  encased  in  it. 

The  less  expensive  alternative  is  to  have  heteroge- 
noues  heirdware  modules  to  increase  fault  tolerance 
and  resistance  to  faulure  [1].  In  our  micro  satellite  the 
processors  are  an  Intel  80C188EC,  80386SL  and  an 
INMOS  T800. 

In  any  space  based  system,  the  ground  support  adds 
another  dimension  to  heterogeneity,  because  it  being 
accessible  for  repair  and  the  power  budget  infinite,  it 
does  not  have  to  be  the  saune  architecture  as  the  space 
segment. 

2.2  Software  composition 

The  levels  of  reliability  required  from  software  in 
the  application  domain  ranges  from  ultra  reliability 
to  acceptance  of  an  error  once  per  30  days.  The  table 
in  fig  1  summarizes  the  software  functions  and  the 
reliability  required. 

3  Languages 

There  has  been  numerous  reviews  and  comparisons 
of  languages  suitable  for  real-time  applications  [2,  3]. 
It  all  cases  it  was  argued  that  mainstream  languages 
such  as  C,  Pascal,  Ada,  Modula-2  etc.  were  not  suit¬ 
able  for  the  construction  of  real-time  systems. 

Prom  the  available  commercial  real-time  operating 
systems  [4]  one  can  deduce  that  C  and  C-)— I-  are  be¬ 
ing  used  extensively  for  real-time  applications.  From 
our  experience  C  is  not  more  than  a  glorified  assem¬ 
bler  and  is  to  be  avoided  for  dependable  software.  In 
the  favor  of  C  is  that  is  by  far  the  most  widely  imple¬ 
mented  compiler  on  different  hardware  platforms. 

The  following  requirements  dictated  our  language 
choice: 

1 .  Most  of  the  Software  Engineers  are  Electronic  En¬ 
gineers  with  limited  training  in  Software  Engi¬ 
neering. 

2.  The  project  staff  join  the  project  for  at  most  two 
years  at  a  time. 

3.  Most  of  the  flight  software  has  to  be  very  reliable. 

4.  Most  of  the  flight  software  is  to  be  maintained 
across  a  five  year  time  span. 


The  language  of  choice  for  the  space  based  reliable 
software  is  Modula-2  with  small  sections  of  Assembly 
language  in  such  cases  as  working  without  a  stack  be¬ 
fore  the  memory  integrity  has  been  determined. 

Modula-2  provides  the  following  advantages. 

1.  It  is  more  readable  for  people  with  less  software 
training.  This  aides  in  maintenance  and  the  fact 
that  most  of  our  staff  have  limited  training. 

2.  It  is  stricter  than  C  and  C-f-l-  in  its  synteoc  which 
makes  it  more  difficult  to  induce  unintended  er¬ 
rors.  The  stricter  syntax  also  aides  in  the  verifi¬ 
cation  and  validation  process,  making  it  easier  to 
create  reliable  software. 

3.  Compilers  are  available  for  all  our  microproces¬ 
sors. 


4  Operating  System 

The  requirements  placed  on  the  Operating  Sys¬ 
tem  is  different  for  the  space  and  ground  segments. 
The  space  segment  hardware  resources  are  costly  and 
bounded  by  power  and  mass  budgets.  The  ground 
segment  hardware  resources  are  for  all  purposes  un¬ 
bounded. 

The  requirements  for  the  space  based  operating  sys¬ 
tem  are  support  for  process  dispatching,  inter  process 
communication,  synchronization,  loading  and  unload¬ 
ing  of  process  sets,  support  for  interrupt  handling  and 
time  functions.  The  implementation  of  these  require¬ 
ments  on  different  processors  leads  to  the  choice  be¬ 
tween  software  diversity  and  -homogeneity. 

The  cost  effective  use  of  software  diversity  is  ex¬ 
plained  in  detail  in  [1].  It  amounts  to  using  soft¬ 
ware  diversity  on  the  kernel  level  where  it  can  be  af¬ 
forded  and  no  software  diversity  on  the  application 
level,  where  it  is  expensive.  Due  to  the  fact  that  our 
software  on  the  kernel  level  is  reloadable  in  the  final 
space  segment  component,  it  was  decided  in  the  in¬ 
terest  of  development  time  to  choose  one  operating 
system  kernel  for  the  current  software  support. 

The  space  segment  processors  require  an  efficient 
Operating  System  Kernel  in  order  to  make  best  use 
of  the  resources.  There  are  many  such  kernels  on  the 
market  [4],  which  all  support  priority  based  schedul¬ 
ing. 

Deadline  driven  scheduling  is  optimal  when  com¬ 
pared  with  rate  monotonic  [5],  and  deadline  driven 
scheduling  specifies  end  to  end  deadlines  more  suc¬ 
cinctly. 
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Software  component 

Hardware  support 

Reliability  required 

Boot  loader 

Default  application 

Household  and  Integrity  Tasks 
Operating  System  Kernel 

Device  drivers 

Fine  ADCS  Application 

Bulletin  board  Application 
Experimental  Software 

Fusible  link  PROM 
EPROM 

SRAM  with  EDAC 
SRAM  with  EDAC 
SRAM  with  EDAC 
SRAM 

SRAM  with  EDAC 
SRAM  with  EDAC 

No  errors 

No  errors 

One  error  per  30  days 

One  error  per  30  days 

One  error  per  30  days 

One  error  per  30  minutes 
One  error  per  30  days 

One  error  per  orbit 

Figure  1:  The  software  functions  and  the  reliability  required 


It  was  decided  to  go  with  a  simple  kernel  supporting 
deadline  driven  scheduling  [6].  The  specific  paradigm 
was  proposed  by  [7]  and  we  are  using  an  implementa¬ 
tion  done  by  [8],  calling  it  RTXN. 

The  requirement  for  the  uploading  and  unloading 
of  application  task  sets,  are  to  be  supported  through 
the  concept  of  Virtual  Machines  (VM).  Each  VM  will 
represent  a  hardware  resource  or  percentage  of  hard¬ 
ware  resource.  The  executing  code  of  a  VM  can  be 
completely  replaced  without  affecting  any  other  VM. 
Multiple  VMs  are  to  run  on  each  processor  in  the  space 
segment  with  each  VM  representing  a  percentage  of 
the  underlying  resource  available.  Within  each  VM 
the  RTXN  kernel  will  be  executed  on  which  the  ap¬ 
plication  task  set  can  be  executed  with  real-time  con¬ 
straints  guaranteed. 

The  ground  segment  processors  are  in  such  abun¬ 
dance  that  it  was  decided  to  support  the  resource  ad¬ 
equate  paradigm  [9].  Each  application  on  the  ground 
would  run  on  its  own  processor  (80x86  PC).  The 
penalty  to  pay  for  this  approach  is  a  communication 
mechanism  which  must  be  maintained  between  pro¬ 
cessors. 


5  Application  Programmers  Interface 

In  order  to  provide  access  to  a  unified  architecture 
acro;,s  space  and  ground  segments  an  Application  Pro¬ 
grammers  Interface  (API)  is  required.  This  API  is 
called  the  SUNSAT-API  (SSAPI)  and  is  defined  as  the 
simplest  subset  of  functions  required  for  the  parallel 
execution  of  processes  in  a  Hard  Real-Time  environ¬ 
ment. 

The  SSAPI  supports  only  those  functions  as  men¬ 
tioned  in  section  4  and  is  described  in  detail  in  [10]. 
The  same  SSAPI  is  also  supported  on  the  ground  seg¬ 
ment,  which  enables  the  Application  Software  Engi¬ 


neer  to  concentrate  on  the  task  and  not  the  idiosyn¬ 
crasies  of  the  different  kernels  on  the  different  proces¬ 
sors. 

6  Software  development  processes 

The  diverse  range  of  reliability  requirements  on  the 
software  necessitates  a  flexible  approach  to  software 
development.  We  have  opted  to  support  the  water¬ 
fall  development  model  with  the  extension  of  rapid 
prototyping  before  the  software  requirements  phase  is 
complete. 

The  flexibility  is  built  into  the  waterfall  model  by 
requiring  the  outputs  of  the  different  stages  to  be 
checked  only  for  specific  types  of  software.  See  table  2 
for  the  types  of  software  and  the  associated  checks 
which  must  be  performed  on  the  output  of  each  stage. 

The  are  strict  standards  in  place  for  the  following 
outputs; 

1.  Software  requirements  document. 

2.  Software  design  document. 

3.  Software  testing  document. 

4.  Software  coding  standards. 

In  addition  the  design  languages  are  pinned  down  to 
enable  feister  inter-  engineer  communication.  The  de¬ 
sign  languages  can  be  split  in  two  groups  ie.  structural 
and  behavioral.  The  table  in  fig  3  lists  the  options 
available  under  each  of  the  groups. 

The  structural  design  languages  are  aimed  at  arriv¬ 
ing  at  reusable  software  and  specifies  the  way  in  which 
the  software  is  to  packaged.  The  behavioral  design 
languages  specify  how  the  software  is  going  to  behave 
in  multiple  dimensions,  which  include  the  state,  the 
execution  flow  and  the  timing  characteristics. 
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Development  phase 

Type  of  software  output 

Requirements 

Concept  design 

Detail  design 

Coding 

Install 

Hardware  debugging 

* 

i 

* 

Subsystem  testbed 

* 

* 

* 

Hardware  demonstration  software 

* 

* 

Flight  software 

* 

* 

* 

♦ 

Ground  station  software 

♦ 

♦ 

* 

Porting  an  existing  software 

* 

* 

Figure  2;  The  application  of  the  Waterfall  development  lifecycle  to  various  types  of  software 


Structural  design  languages 

Behavioral  design  languages 

Modular  decomposition 

HOOD  (High  Order  Object  Oriented  Design) 
Software  topology 

Hardware  topology 

Flowcharts  or  pseudo  code 
Statecharts 

SSAPI  dataflow  diagrams 

Figure  3:  Structural  and  Behavioral  Design  languages 


7  Results 

The  results  up  to  date  is  promising.  We  have  pro¬ 
gressed  one  year  with  the  initial  testing  of  the  method¬ 
ology  on  small  pilot  projects  within  the  major  project 
and  found  the  following: 

1.  The  choice  of  Modula-2  (and  Pascal  substituted 
in  some  cases)  as  programming  language  has  in 
fact  enabled  the  new  intake  of  engineers  to  get  up 
speed  in  a  shorter  period  of  time. 

2.  The  boot  loader  code  has  gone  through  the  re¬ 
quirements,  conceptual  and  detail  design  phases, 
with  formal  checks  on  the  output  of  each  phase 
and  we  have  a  greater  confidence  in  its  correctness 
than  any  other  piece  of  software  on  the  project. 

3.  We  have  found  that  the  proper  execution  of  the 
requirements  and  conceptual  design  phases  did 
take  about  70%  of  the  time,  but  that  the  coding 
was  in  fact  a  formality. 

4.  For  non-ultra  reliable  software  we  found  that  it 
is  adequate  to  proceed  with  the  design  until  the 
module  interface  level.  The  additional  time  spent 
on  detail  design  did  not  provide  any  additional 
benefits  in  saving  time  for  the  reliability  required. 

5.  The  SSAPI  is  proving  to  be  of  great  benefit  as 
more  than  one  person  is  working  on  the  software 
for  the  same  processor.  The  common  design  lan¬ 
guage  has  definitely  saved  on  man  hours  support¬ 
ing  the  execution  environment. 


6.  The  resource  adequate  strategy  for  the  ground 
station  is  proving  that  every  single  application 
developer  can  go  ahead  regardless  of  any  other 
person.  This  facilitates  true  concurrent  engineer¬ 
ing  and  simplifies  the  checking  for  adhering  to 
hard  real-time  requirements. 

7.  The  choice  of  going  with  a  simple,  non¬ 
commercial  kernel  heis  not  proved  us  wrong  yet. 
The  simplicity  ensures  that  one  person  can  un¬ 
derstand  the  complete  kernel  which  ensures  com¬ 
plete  transparency  through  the  kernel  when  im¬ 
plementing  on  a  specific  hardware  platform. 


8  Conclusion 

We  have  had  to  make  many  choices  for  implement¬ 
ing  a  Real-Time  System  on  a  heterogeneous  platform. 
Neither  the  programming  language  nor  the  operating 
system  kernel  is  mainstream,  but  we  believe  that  the 
penalty  paid  for  it  (lack  of  support)will  provide  the 
return  in  ease  of  maintenance  in  the  long  run. 

We  have  finished  our  initial  investigation  into  the 
suitability  of  the  paradigm  explained  and  we  are  pro¬ 
ducing  software  for  our  first  integration  of  the  Engi¬ 
neering  model  b2Bed  on  the  paradigm  selected.  On 
completion,  the  paradigm  will  be  evaluated  before  the 
actual  flight  software  is  produced  in  the  latter  part  of 
1994  and  the  beginning  of  1995. 
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Abstract 

In  this  paper,  we  present  an  efficient  method  us¬ 
ing  Specification  and  Description  Language  (SDL)  for 
designing  and  implementing  real-time  embedded  sys¬ 
tems.  We  also  discuss  the  implementation  of  a  com¬ 
panion  kernel  for  a  SDL  based  Design  Tool  (SDT)  * 
CASE  tool  environment  to  generate  real-time  OS  based 
pSOS  ^  multi-tasking  application  software  by  apply¬ 
ing  defined  mapping  translation  rules.  Since  SDL  is 
a  formal  specification  and  description  language,  with 
the  CASE  environment  SDT  support,  the  major  port 
of  a  system  can  be  analyzed,  simulated,  verified,  and 
validated  at  early  stages  during  system  development. 
The  concurrency  due  to  the  multiple  concurrent  state 
machines  in  a  system  is  preserved  in  target  run-time 
environment.  Because  the  SDL  described  system  uses 
message-passing,  a  distributed  version  can  be  rela¬ 
tively  easy  to  derive  from.  To  emphasize  the  proposed 
method  using  SDL  without  dramatically  compromising 
the  memory  and  response  time  (speed),  we  show  the 
results  obtained  from  pSOS  implementation  of  an  Ac- 
cessControl  system.  We  also  outline  some  areas  that 
we  are  continuing  to  work  on. 


1  Introduction 

1.1  Eveloving  Software  Development 
Methods  for  Embedded  Systems 

Software  system  design  and  development  has  long 
been  regarded  as  art  rather  science  or  engineering. 
The  term  Software  Engineering  was  formally  coined 
at  a  NATO  conference  in  1968,  which  signalled  that 

*SDT  is  s  tradenuurk  of  TeleLOGIC  AB. 

^pSOS  U  m  trademark  of  Integrated  System  Inc. 


systematic  and  engineering  dispHnary  methods  and 
techniques  should  be  employed  to  produce  quality  and 
cost-effective  software  system. 

Significant  improvements  have  been  made  since 
then  in  many  fields  that  ensure  the  software  systems 
with  better  quality  in  a  controlled  and  cost-effective 
manner  thanks  to  the  research  efforts  in  the  software 
engineering,  methodology,  and  computer-aided  soft¬ 
ware  engineering  envrionment.  Many  fairly  sophisti¬ 
cated  and  large  CASE  tools  exist,  such  as  Teamwork, 
Statemate,  and  ER-Designer  [4],  which  assist  the  sys¬ 
tem  designers  and  developers  in  continuing  improv¬ 
ing  the  quality  and  productivity  of  documentation  and 
code. 

The  traditional  real-time  embedded  systems  devel¬ 
opment  methods  emphasize  memory  and  speed  by 
handcoding  the  software  systems  in  assembly  lan¬ 
guages. 

With  the  increasing  performance  of  processors  and 
hardware  devices,  memory  and  speed  are  less  impor¬ 
tant.  Thus  the  hardware  technology  has  led  to  widely 
use  mixed  development  of  software  systems.  The  ma¬ 
jority  of  the  software  systems  is  developed  on  the  host 
development  environment  in  high-level  programming 
language,  such  as  C  or  C-h-l-.  Assembly  code  is  still 
needed  for  some  certain  time-critical  tasks  and  device 
drivers  and  interrupt  routines. 

Object-Oriented  techniques  and  methodologies  fur¬ 
ther  improve  the  real-time  embedded  systems  design 
and  implementation  in  terms  of  software  reusability, 
quality,  documentation,  and  classification  [5].  Object- 
Oriented  CASE  tool  environements  have  played  major 
role  in  this  new  trend. 
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1.2  Current  Methods  with  CASE  Tool 
Support  and  Our  Approach 

The  following  steps  are  not  uncommon  for  the  cur¬ 
rent  embedded  systems  design  and  implementation; 

Step  1  :  Requirement  analysis 

Step  2  :  Specification  and  design 

Step  5 :  Implementation 

•  Host  simulation  and  debugging 

•  Target  simulation  and  debugging 

Step  4  :  Testing  (and  may  iteratively  goto  Step  }) 

There  are  a  number  of  problems  among  the  exist¬ 
ing  CASE  tools  used  today,  such  as  informality,  non- 
analysibility,  and  non-validatability,  to  name  a  few. 

Aside  from  the  confusion  of  different  notations  for 
those  methods,  most  of  them  are  based  on  preprotary 
formalisms  and  lack  mathematic  foundation  needed 
for  system  consistency  analysis,  verification,  and  vali¬ 
dation. 

There  tends  also  to  be  a  gap  between  design  and 
implementation.  Design  is  automated  to  certain  de¬ 
gree,  but  implmentation  is  conducted  separately  and 
often  manually.  The  implication  is  that  the  implemen¬ 
tation  is  not  ensured  due  to  the  discontinuity  between 
the  design  and  implementation. 

The  next  factor  that  hinders  the  software  systems 
prototyping  and  implementation  for  the  embedded 
systems  is  coding  and  debugging.  Even  using  increas¬ 
ingly  sphisticated  host  development  environments,  de¬ 
velopers  are  often  overwhelmed  in  coding  at  the  API 
level  (C  or  C-l-l-  lanaguage).  The  debugging  only  oc¬ 
curs  after  downloading  the  cross-compiled  system  onto 
the  target.  Then  the  usual  nightmare  is  that  signifi¬ 
cant  time  is  spent  on  deep-level  dubugging.  Though, 
sometimes,  host  simulation  and  debugging  are  pro¬ 
vided. 

Our  primary  focus  is  try  to  emphasize  the  suitabil¬ 
ity  of  SDL  in  embedded  systems  specification  and  de¬ 
sign.  Secondly,  we  tried  to  bridge  the  gap  between 
design  and  implmentation  by  mapping  out  rules  and 
algorithms  to  automatically  generate  the  final  target 
environment  code. 

SDL  is  a  well-known  specification  and  description 
language  standardized  by  ITU  ^  as  Z.lOO  [1]  [2].  Its 
successs  and  popularity  are  still  growing,  especially 

*rrU  (taacb  for  International  Telecommunications  Union 
previously  called  CCITT. 


in  the  telecom  and  datacom  sectors,  especially  in  Eu¬ 
rope.  Moreover,  this  proven  technology  for  the  sys¬ 
tems  specification  and  design  has  been  evolving  grad¬ 
ually.  SDL92  is  more  object-oriented. 

The  methodology  for  using  SDL  and  SDT  in  de¬ 
signing  and  implmenting  software  systems  for  the  real¬ 
time  embedded  systems  is  similar  to  that  for  the  tele¬ 
com  software  system  as  outlined  below 

Step  1  :  Requirement  analysis  (Message  Sequence 
Chart) 

Step  2  :  System  Specification  (SDL) 

Step  S  :  Semantic  and  dynamic  system  analysis 
Step  4  •  Simulation,  verification,  and  validation 
Step  5 :  Code  generation 
Step  6  :  Testing  (conformance  testing) 

Step  1  through  Step  6  are  supported  by  SDT. 
Various  SDL  techniques  have  been  studied  and  ex- 
eprienced  successfully,  and  broad  literature  can  be 
found  [8]  [7]  [1]  [?].  We  will  mainly  discuss  the  code 
generation  for  the  embedded  system  since  the  soft¬ 
ware  systems  for  the  embedded  systems  have  some 
unique  characteristics  that  are  often  not  required  by 
other  types  of  applications,  namely,  time-criticality, 
response  time,  and  fast  I/O  processing. 

To  meet  such  kinds  of  operational  requirements, 
two  types  of  philosophy  are  instrumental  in  the  de¬ 
sign  process: 

•  Build  software  system  on  top  of  a  commercial 
real-time  operating  system 

*  Build  software  system  along  with  a  self  developed 
kernel 

Depending  upon  the  applications  and  desired  embed¬ 
ded  system  configuration,  these  two  approaches  are 
alternately  used  or  interleaved.  For  instance,  a  dis¬ 
tributed  embedded  system’s  front  end  agent  does  not 
have  real-time  OS,  but  the  some  large  nodes  and  host 
nodes  employ  the  real-time  OSs. 

When  building  around  an  existing  real-time  OS,  we 
describe  a  model  and  SDT  to  automatically  generate 
the  target  code  that  fully  utilizes  the  real-time  OS 
scheduling,  memory  management,  and  time  manage¬ 
ment  capabilitie.  For  the  latter  approach,  we  augu- 
ment  the  model  by  adding  additional  scheduling  algo¬ 
rithms  to  the  generic  SDT  kernel  to  fit  a  particular 
application. 

Critics  have  been  questioning  automatic-code  gen¬ 
eration  in  terms  of  the  quality  of  code,  size  of  the  code. 
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speed  of  systems  with  regard  to  response  time  to  criti¬ 
cal  device  requests  and  interrupts.  We  have  conducted 
some  initial  analyses  based  on  an  AccessControl  sys¬ 
tem  generated  for  a  M68K  systems  rnning  pSOS. 

The  following  are  major  highlights  from  our  expe¬ 
rience; 

•  Ck>de  size  overhead:  1.3-1.5:1 

•  Code  quality:  No  debugging  efforts  for  the  major¬ 
ity  of  system  except  for  the  interrupts  and  device 
driver  part  and  interface  part  during  the  final  in- 
tergation. 

Systems  maintenance  and  reusability  are  much  easier 
now  due  to  the  fact  that  we  are  dealing  with  the  most 
part  of  the  software  system  at  a  high  level  -  specifica¬ 
tion  and  design. 

Research  efforts  are  onging  on  the  use  of  SDL  for 
system  specification  and  design.  Some  researchers 
have  proposed  notations  and  supporting  systems  try¬ 
ing  to  bridge  the  gap  between  requirement  analysis 
and  specification.  To  a  certain  degree,  the  design  spec¬ 
ification  can  be  automatically  produced  from  require¬ 
ment  specification  [9]. 

Obviously,  this  method  along  with  the  tool  sup¬ 
port  provides  us  with  better  system  maintainability, 
tracibility,  and  other  benefits,  such  as  testibility. 


2  SDL  based  Approach  Overview 
2.1  Overview 

The  SDL  based  software  modelling  technique  is  not 
new.  The  language  was  standardized  in  1988  as  Z.lOO 
and  it  is  still  evolving.  The  current  SDL92  is  a  su¬ 
perset  of  the  SDL  with  object-orientation  extension, 
which  is  als<~  called  object-oriented  SDL. 

SDL  was  created  for  modelling  large  and  complex 
distributed  real-time  telecommnication  software  sys¬ 
tems.  Substantial  experiences  have  been  gained  in  us¬ 
ing  SDL  for  the  system  specification  and  validation.  In 
many  cases,  real  imflementation  code  was  generated 
and  deployed. 

Many  methodologies  based  SDL  have  been  pro¬ 
posed  and  employed  [1]  [7]  [6]  for  the  systems  spec¬ 
ifications  and  design.  We  will  omit  the  mature  and 
common  set  of  the  procedures  and  stress  those  issues 
specially  related  to  embedded  systems,  such  as  target 
code  generation  and  systems  integration. 

The  following  steps  are  suggested  for  modelling  and 
implementing  an  embedded  system: 


Step  1 :  Define  the  system  architechare  (configuration 
of  hardware  and  software); 

Step  2:  Define  the  interface  between  the  software  sys¬ 
tem  and  hardware  system  in  terms  of  interrupts 
and  device  drivers; 

Step  St  Model  software  system  using  SDL  and  hard¬ 
ware  interface  in  the  environment  either  using 
SDL  or  Unix  processes  as  separate  systems; 
(Note;  the  hardware  device  drivers  and  interrupt 
routines  in  assembly  level  language  can  be  devel¬ 
oped  simutanously.) 

Step  4’  Use  the  SDL  modelling  techniques  to  proto¬ 
type,  simulate,  validate,  and  test  (conformance 
test)  the  system  under  design; 

Step  5:  Generate  target  code  when  Step  1  through 
Step  4  completed  by  compiling  and  linking  the 
proper  shared  library,  i.e.  library  for  taregt  real¬ 
time  OS  or  without  target  real-time  OS; 

(Note;  test  suites  also  can  be  generated  automat¬ 
ically  for  unit  or  system  test.) 

Step  6:  Use  proper  target  debugging  environement  to 
test  the  integrated  embedded  system  on  the  tar¬ 
get  systems;  and 

Step  7:  Depending  on  necessity,  perform  some  man¬ 
ual  code  optimization  for  certain  critical  tasks. 

In  this  suggested  method.  Step  I  through  Step  4 
are  relatively  well  understood  and  deployed.  Step  5 
through  Step  7  are  of  interest  to  us  here  in  paper. 

2.2  Tool  configuration  for  target  code 
generation 

In  SDL,  system  behavior  is  described  by  state  ma¬ 
chines  each  representing  a  SDL  process.  Communica¬ 
tions  mechanisms  among  the  blocks  and  processes  are 
realized  by  signalling  (i.e.  sending  and  receiving  mes¬ 
sages).  The  time  constrains  are  imposed  by  setting 
timers.  The  code  generated  from  SDL  design  imple¬ 
ments  a  complex  finite  state  machine  (FSM)  compris¬ 
ing  many  concurrent  FSMs  communicating  via  signals. 

Figure  1  shows  the  SDT  tool  configuration  with  ex¬ 
tension  for  generating  pSOS  target  code. 

For  many  embedded  systems,  there  are  some  special 
issues,  such  as  concurrency  (multi-tasking),  synchro¬ 
nization,  fast  multiple  I/O  processing,  and  interrupt 
handling. 

SDT  has  a  generic  kernel  for  both  host'  environ¬ 
ment  simulation,  validation,  and  verification  and  tar¬ 
get  envrionment  with  some  minor  patching  in  clock 
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Figure  1:  SDT  tool  configration  with  extension  for 
generating  pSOS  real-time  OS  based  software  systems: 
dashed-lined  component 

and  memory  management  handling.  The  SDT  sched¬ 
uler  is  priority  oriented  with  granularity  to  the  state 
transition.  The  synchronization  among  processes  is 
realized  through  signalling. 

For  embedded  systenns  without  real-time  OS,  a 
generic  SDT  is  privided  by  SDT  CASE  tool.  The  gen¬ 
erated  code  from  the  SDL  design  can  run  with  the 
SDT  kernel.  If  there  are  specied  domain  requirements 
to  the  OS,  the  current  SDT  scheduling  algorithm  can 
be  modified,  such  as  time-slicing  and  pre-emption. 

When  generating  code  for  embedded  systems  with 
Real-Time  OS  support,  such  as  pSOS,  the  system  de¬ 
signed  in  SDL  will  be  transformed  into  the  designated 
target  real-time  OS  based  software  system.  The  map¬ 
ping  rules  and  transformation  algorithms  are  main¬ 
tained  in  the  pSOSlib.  The  current  implementation 
maps  the  processes  in  SDL  to  a  real-time  OS  tasks 
and  signals  to  messages. 

3  The  Mapping  Model  for  Generating 
Real-Time  OS  based  Code 

The  design  for  embedded  systems  without  real-time 
OS  support  has  been  briefly  discussed  in  the  previous 
section.  We  focus  on  the  mapping  rules  and  trans¬ 
lation  algorithms  to  generate  code  for  pSOS  target 
environment  here  *. 

*Note:in  the  discussion  followed,  although  pSOS  is  used,  the 
mapping  rules  an  similar  when  applying  them  to  other  real-time 


3.1  SDL  Systems  Semantics  and  Its  Exe¬ 
cution  Model 

For  continuity,  we  only  revisit  the  definition  and 
key  aspects  of  the  model  and  its  semantics.  Refer 
[1]  and  [8]  for  detailed  description.  SOL  is  based  on 
the  concepts  of  a  system  of  communicating  extended 
finite-state  machine  (CEFSM)  that  comunicate  with 
one  another  and  their  common  environment  by  sig¬ 
nals  in  an  asynchronous  manner  via  possibly  delaying 
communication  paths.  These  signals  are  buffered  on 
arrival  at  a  process. 

An  SDL  specification  represented  by  a  execution 
model  comprises  seven  meta-process  types.  These 
processes  are  Communicating  Sequential  Processes 
(CSP),  and  thus  they  communicate  using  synchronous 
events.  These  meta-processes  are:  (1)  system 
The  only  instance  of  this  meta-process  type  creates 
all  other  meta-processes  and  maintains  them;  (2) 
delayiug-path  One  instance  of  this  process  type  for 
each  communication  path  between  processes  of  the 
SDL  system;  (3)  global-time  The  entity  knows  the 
current  global  time;  (4)  view  The  only  instance  of  this 
process  type  keeps  track  all  viewed  variables;  (5)  sdl- 
process  An  instance  of  this  meta-process  type  repre¬ 
sents  a  process;  (6)  input  queue  An  instance  of  this 
meta-process  type  represents  the  input  queue  of  a  pro¬ 
cess;  (7)  timer  An  instance  of  this  meta-process  type 
represents  a  timer  of  a  process. 

3.2  Mapping  Rules  and  Trasformation 
Algorithms 

When  formulating  mapping  rules  and  transforma¬ 
tion  algorithms  to  transform  a  SDL  system  onto  a 
real-time  OS  system,  such  as  pSOS,  the  rules  to  fol¬ 
low  are:  (1)  preserve  SDL  dynamic  semantics;  (2) 
identify  RTOS  objects  to  realize  SDL  objects;  (3)  im- 
plemetn  additional  software  to  augument  the  portions 
that  RTOS  does  not  support  or  best  match. 

In  our  case,  we  map  a  SDL  process  to  a  pSOS  task, 
a  input  queue  to  a  pSOS  message  queue.  Due  to  some 
deficency  between  pSOS  time  management  services 
and  SDL’s,  we  implemented  our  own  timer  manage¬ 
ment.  When,  some  improvements  on  the  time  services 
made,  we  are  to  use  pSOS’s  for  a  tigher  integration 
which  will  reduce  code  size  and  fully  utilize  the  pSOS 
kernel  features.  Because  of  the  richer  SDL  communi¬ 
cation  properties,  such  as  sending  signal  without  spec¬ 
ifying  receiver  PID,  we  implemented  a  relative  concise 
system. 

OS  environments  such  as  VxWorks. 


The  view  metarprocess  was  not  implemented  con¬ 
sidering  its  side-effect  on  the  code  modularity  and 
reusability. 

3.3  Opimization  Issues 

To  optimize  the  performance,  we  deliberately  re¬ 
quire  the  SDL  user  to  specify  the  receiving  process 
PID.  The  idea  is  that  no  routing  function  needs  to  be 
performed  to  search  the  receiver  in  a  embedded  envi¬ 
ronment.  However,  we  did  implemented  the  full-SDL 
routing  semantics  for  those  who  desire  to  use. 

4  Experiment  Results  and  Analysis 

We  have  used  a  classic  example  in  the  SDL  litera¬ 
ture  [8]  [6]  AccessControl  system  to  perform  some  pre¬ 
limary  performance  analysis.  Following  are  the  major 
points  to  be  discussed: 

•  Outline  of  system  requirements 

•  System  description  in  SDL 

•  Measurements  of  generated  pSOS  application 
code 

•  Runtime  performance  data  analysis,  such  as  sys¬ 
tem  call  frequency,  input  queue  length,  and  re¬ 
sponse  time 

At  the  end  of  this  section,  some  comparison  data 
analysis  is  expected  And  performance  analysis  for  gen¬ 
erated  system  with  or  without  routine  module  is  given. 
We  also  report  how  the  performance  was  improved  by 
introducing  a  static-dynamic  combined  mechanism  to 
optimize  the  performance  and  maintain  the  SDL  se¬ 
mantics. 

5  Conclusion  and  Future  Work 

We  think  the  method  using  SDL  with  the  compan¬ 
ion  tool  to  a  large  extent  will  improve  the  quality,  lead 
time,  and  cost  for  engineering  embedded  systems. 

However,  there  are  other  issues  and  areas  to  be  fur¬ 
ther  studied  in  this  direction.  The  following  list  is  in¬ 
tended  to  shed  some  light  on  what  we  feel  that  should 
be  focused: 

•  Automatic  generating  distributed  applications  for 
multi-processor  architechure; 

•  Process-task  grouping  rules  and  realization 

«  Code  optimization 
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Abstract 

The  complexities  of  real-time  systems  are  such  that 
it  is  often  thought  necessary  to  give  a  formal  justi¬ 
fication  of  their  correctness,  especially  if  they  are  to 
be  used  in  a  safety- critical  environment.  In  this  pa¬ 
per  we  describe  our  wort  on  a  formally  based  design 
method  for  real-time  systems  which  allows  the  timing 
aspects  of  a  concurrent  system  to  be  mathematically 
described  and  verified,  as  well  as  semi- automatically 
implemented.  Our  design  language,  AORTA,  is  a 
timed  process  algebra,  with  features  to  ensure  that  all 
designs  can  be  implemented.  A  predictable  real-time 
kernel  is  also  described,  which  is  used  in  the  construc¬ 
tion  of  a  system  from  an  AORTA  design,  and  which 
allows  the  timing  of  the  implementation  to  be  verified. 

1  Background  and  motivation 

There  is  much  existing  work  on  methods  for  real¬ 
time  systems,  both  on  the  theoretical  aspects  of  veri¬ 
fying  the  correctness  of  a  real-time  system  [1],  and  on 
the  practical  ways  of  guaranteeing  performance  via  a 
real-time  kernel  [2],  but  little  that  links  the  two.  If 
preu;tical  formal  techniques  are  to  be  found  for  real¬ 
time  systems,  then  both  high-level,  more  theoretical 
aspects  must  be  considered  as  well  implementation 
performance  issues  such  as  scheduling.  There  is  some 
work  which  attempts  to  link  the  higher  and  the  lower 
level,  such  as  the  implementation  of  formal  models  by 
compilation  of  (untimed)  LOTOS  [3,  4],  or  the  over¬ 
all  system  design  methodology  of  the  (non-formally 
based)  MARS  project  [5],  but  we  are  not  aware  of  any 
work  that  stddresses  the  practical  design,  implemen¬ 
tation,  and  formal  verification  of  a  time-critical  sys¬ 
tem.  In  response  to  this  apparent  lack,  we  have  devel¬ 
oped  a  formal  design  language  based  on  process  alge¬ 
bra  called  AORTA  (Application  Oriented  Real-Time 
Algebra)  [6],  with  the  specific  aim  of  providing  a  com¬ 


plete  route  from  a  (timed)  formal  specification  to  a 
verified  implementation. 

The  first  goal  of  our  project  has  been  to  provide  a 
means  of  producing  a  real-time  system  from  its  formal 
expression  in  AORTA,  as  most  of  the  existing  theoreti¬ 
cal  work  covers  the  verification  of  a  design  with  respect 
to  a  formal  specification.  Work  on  real-time  kernels 
has  made  advances  in  ensuring  that  processes  will  get 
through  their  work  as  quickly  as  possible.  Sometimes, 
however,  this  is  at  the  expense  of  making  the  schedul¬ 
ing  arrangements  too  complex  to  be  able  easily  to 
provide  reliable  predictions  about  the  performance  of 
an  interacting  set  of  processes.  Whilst  priority-based 
scheduling  algorithms  may  be  provably  optimal,  it  is 
not  always  optimality  that  is  important  —  in  particu¬ 
lar,  predictability  of  time-critical  systems  can  be  cru¬ 
cial.  On  this  basis  we  have  reverted  to  a  very  simple 
yet  predictable  fixed  time-slice  round-robin  scheduler, 
so  that  timing  is  easier  to  predict,  as  the  performance 
of  each  process  does  not  depend  on  the  performance  of 
others  except  at  explicit  communication  or  synchroni¬ 
sation  points.  The  efficiency  sacrificed  in  using  such 
a  scheduling  mechanism  is  balanced  with  the  reduced 
cost  of  developing  a  verified  system;  as  hardware  costs 
are  relatively  low  compared  with  development  costs, 
we  feel  that  this  tradeoff  is  often  justified. 

The  kernel  also  provides  sound,  safe  and  predictable 
communication  primitives,  based  on  Ada  style  syn¬ 
chronous  communication,  which  correspond  directly 
to  the  communication  constructs  in  the  AORTA  de¬ 
sign  language.  Together  with  a  timeout  facility,  this 
provides  a  direct  route  to  implementation,  by  C  code 
generation  from  the  AORTA  design.  Although  the 
AORTA  design  only  deals  with  the  timing  and  inter¬ 
communication  of  the  processes,  the  sequential  code 
within  a  process  is  included  in  a  manageable  way, 
and  the  timing  of  non-generated  code  is  verified  by 
a  combination  of  bounds  on  processing  time  of  the 
code  and  the  processing  distribution  figures  available 
for  the  kernel  in  [7]. 
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2  The  AORTA  design  language 

Timed  process  algebras  are  widely  known,  but  are 
usually  used  for  modelling  or  specifying  real-time  sys¬ 
tems,  rather  than  designing  them.  Our  language, 
whilst  having  some  features  in  common  with  timed 
(and  untimed)  process  algebras,  is  distinguished  by 
its  implementability.  There  are  two  main  points  of  dif¬ 
ference,  the  first  being  that  the  number  of  processes 
within  a  system  is  constant  throughout  the  lifetime 
of  the  system,  making  processor  allocation,  and  hence 
computation  times,  easier  to  verify.  Secondly,  all  tim¬ 
ings  in  the  design  can  be  expressed  as  upper  and  lower 
bounds,  rather  than  exact  figures,  as  in  reality  bounds 
can  usually  be  given,  where  precise  figures  may  not  ex¬ 
ist.  It  is  this  representation  of  time  bounds  which  is 
most  problematic  in  existing  timed  process  algebras. 

In  order  to  keep  the  model  of  the  timing  tractable, 
data  within  a  system  is  not  represented  in  AORTA, 
with  all  computation  being  represented  only  by  the 
(bounds  on  the)  amount  of  time  required  to  complete 
it.  Within  each  process  communication  is  represented 
by  the  name  of  the  gate  on  which  communication  is 
to  take  place,  and  bounds  are  placed  on  the  amount 
of  time  between  both  sides  of  the  communication  be¬ 
ing  ready,  to  the  communication  actually  taking  place. 
Processes  are  written  in  a  simple  equational  format, 
similar  to  that  used  in  many  other  process  algebras. 
A  process  written 

Convert  =  in . [100 , 160] out . Convert 

will  wait  for  communication  on  its  in  gate,  before  do¬ 
ing  some  computation  which  lasts  for  between  100  and 
150  milliseconds  (any  time  units,  discrete  or  dense  may 
be  chosen  for  a  design  —  [0.1,0.15]  would  be  an  equally 
valid  expression)  and  offering  communication  on  its 
out  gate.  Once  this  second  communication  has  taken 
plstce  the  process  will  start  again  waiting  for  an  in 
communication.  This  is  how  a  process  which  accepted 
temperature  data  and  converted  it  to  a  different  for¬ 
mat  would  be  expressed.  The  bounds  on  communi¬ 
cation  times  are  given  by  a  separate  function  which 
takes  a  gate  identifier  and  returns  a  time  interval. 

A  choice  construct  is  provided,  similar  to  the  +  of 
CCS  and  the  select  statement  of  Ada,  which  allows 
several  possible  communications  to  be  offered.  The 
future  behaviour  of  the  process  depends  on  which  is 
completed  first.  Timeouts  may  also  be  defined,  so 
that  if  none  of  the  communication  choices  offered  are 
taken  up  within  a  certain  time,  then  control  passes  to 
another  branch.  Again,  exact  figures  are  not  usually 
available  for  occurrences  of  timeouts,  so  bounds  are 


used  instead.  If  our  Convert  process  is  to  accept  input 
or  allow  its  conversion  mode  to  be  changed,  this  would 
be  written 

Convert  =  in. [100, 160] out. Convert 
+ 

mode . [300,400] Convert 

where  the  reconfiguration  procedure  takes  between 
300  and  400  milliseconds.  If  the  data  offered  on  the 
out  gates  is  also  to  be  kept  up  to  date  then  it  may 
need  to  be  refreshed  every  1.5  seconds  or  so,  which  is 
achieved  by  adding  a  timeout  to  the  out  communica¬ 
tion: 

Convert  =  in. [100,160] 

(out . Convert) [1460, 1660>Convert 

+ 

mode . [300 , 400] Convert 

where  1450  and  1550  are  estimates  on  the  bounds 
which  can  be  placed  on  a  timeout  of  about  1500  mil¬ 
liseconds. 

The  last  construct  which  can  be  used  in  the  defini¬ 
tion  of  individual  processes  is  a  data-dependent  choice, 
used  where  the  flow  of  control  of  the  process  depends 
on  the  value  of  some  data  in  the  system.  Data  is  not 
modelled  in  AORTA,  so  this  is  essentially  a  nondeter- 
ministic  choice  as  far  as  the  process  algebra  is  con¬ 
cerned.  The  choice  between  two  possible  behaviours 
is  represented  by  ++,  so  that  if  our  Convert  process  is 
to  give  a  warning  if  the  value  that  it  finds  is  outside  a 
certain  range,  this  would  be  written 

Convert  =  in.(Convert2  ++  naming.  Convert2) 

+ 

mode . [300 , 400] Convert 
Convert2  =  [100,160] 

(out . Convert ) [1460 , 1660>Convert 

A  system  usually  consists  of  the  parallel  composi¬ 
tion  of  two  or  more  processes;  this  is  represented  using 
the  traditional  process  algebra  bar  I ,  with  a  connec¬ 
tion  set  showing  pairs  of  gates  which  may  communi¬ 
cate.  This  explicit  connection  of  gates  allows  for  a 
more  efficient  implementation,  and  simplifies  verifica¬ 
tion.  The  connection  set  is  represented  by  pairs  of  gate 
identifiers  written  in  angle  brackets  after  the  processes 
of  the  system.  A  plant  control  system  incorporating 
the  Convert  process  with  a  Control  process  sind  a 
Datalogger  process,  is  written  as  follows: 

Tempsys  = 

(  Control  I  Convert  I  Datalogger  ) 

< ( Control . changem , Convert . mode ) , 
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(Control,  twphlgb.  Convert  .saming) , 

(Convert . out .Datalogger . getdata) , 

connections  between  Control  and  Datalogger 

> 

The  ordering  of  the  pairs  of  gates  is  not  important, 
and  not  all  gates  of  the  processes  need  be  connected: 
those  left  free  will  have  to  communicate  with  the  envi' 
ronment,  like  the  in  gate  of  our  Convert  process.  Fig¬ 
ure  1  gives  a  diagrammatic  representation  of  Teapsys. 

Most  features  of  typical  small  embedded  systems 
can  be  designed  using  this  language:  resource  con¬ 
tention  can  be  handled  with  the  choi>  instruct,  and 
polling  loops  with  a  ^imeout.  As  wel.  .  ^  allowing  im¬ 
plementation,  AORTA  has  a  formal  semantics  given 
in  terms  of  a  timed  transition  system,  which  allows 
formal  reasoning  to  be  done  about  the  design,  and  the 
possible  application  of  model-checking  techniques  such 
as  [8,  9].  Space  does  not  allow  us  to  go  into  the  details 
of  the  semantics  here,  but  see  [6]  for  more  details  — 
it  remains  now  to  show  how  AORTA  designs  can  be 
implemented  in  practice. 

3  Implementation  and  the  kernel 

As  the  main  point  of  the  AORTA  design  language  is 
its  implementability,  we  outline  here  the  kernel  which 
we  have  written  which  allows  AORTA  designs  to  be 
verifiably  implemented.  We  mentioned  earlier  that  we 
have  adopted  a  very  simple  approach  to  scheduling 
in  order  to  be  able  easily  to  verify  the  performance 
of  each  of  the  processes.  This  is  achieved  by  using  a 
fixed  time-slice  round  robin  scheduler,  where  a  fixed 
schedule  of  processes  is  executed  on  the  kernel  at  a 
fixed  frequency,  so  that  each  process  has  a  guaranteed 
amount  of  processing  time  per  unit  real  time,  unaf¬ 
fected  by  the  performance  of  the  other  processes.  For  a 
given  amount  of  processing  time  required,  bounds  can 
be  put  on  the  amount  of  real  time  required,  so  that 
the  timing  of  a  piece  of  sequential  computation  can 
easily  be  verified  given  the  processing  requirements  of 
that  computation.  Bounds  on  the  computation  time 
required  for  a  piece  of  code  C2«i  be  found  using  tech¬ 
niques  such  as  described  in  [10]. 

At  each  scheduling  point,  the  kernel  checks  through 
the  list  of  connected  gates  to  see  if  there  is  a  pair 
which  is  ready;  if  there  is  then  it  effects  the  commu¬ 
nication,  signalling  to  the  processes  involved  that  it 
has  t2dcen  place,  ^tnd  disables  communication  on  gates 
which  were  in  choice  with  the  successful  gates.  It  then 
looks  for  possible  external  communications  (on  gates 
that  are  left  unconnected)  before  looking  through  the 


list  of  timeouts  to  see  if  any  have  exceeded  their  time 
limit.  By  checking  for  communications  and  timeouts 
at  every  reschedule,  bounds  can  be  placed  on  the  time 
for  a  communication  to  take  place  once  enabled,  and 
for  a  timeout  to  come  into  effect. 

Communication  primitives  are  offered  by  the  kernel 
as  C  functions  which  are  called  from  the  processes. 
The  calls  to  the  kernel  are  generated  automatically 
from  the  design,  along  with  the  parameters  of  the  ker¬ 
nel,  such  as  the  number  of  processes,  and  the  gates 
which  are  to  be  connected.  Details  such  as  the  code 
to  be  executed  as  part  of  a  computation  delay,  the 
data  to  be  passed  in  the  communication,  and  the  con¬ 
ditions  for  a  data-dependent  choice  are  attached  as 
annotations  to  the  design.  They  have  no  interpreta¬ 
tion  in  the  formal  semantics,  where  they  are  viewed  as 
comments,  but  they  allow  the  code  to  be  included  in 
the  correct  place  without  having  to  edit  the  code  gen¬ 
erated  from  the  design.  Putting  the  kernel  together 
with  the  generated  code  allows  an  implementation  to 
be  generated  automatically  from  an  annotated  design. 
The  pieces  of  sequential  code  still  have  to  be  hand  con¬ 
structed,  but  once  written,  their  timing,  and  hence  the 
timing  of  the  whole  system  can  be  verified. 


The  work  on  providing  a  route  from  design  to  im¬ 
plementation  is  complete,  so  that  a  system  can  be  built 
automatically  from  its  design.  Although  all  of  the  veri¬ 
fication  methods  are  manually  available,  they  have  not 
yet  been  integrated  into  a  single  tool.  The  verification 
of  an  AORTA  design  will  be  addressed  soon,  but  there 
is  currently  a  simulator  tool,  which  allows  a  design  (in¬ 
cluding  its  timing)  to  be  tested  out  by  a  user  as  the 
first  step  in  a  verification  process.  It  is  hoped  that 
existing  formal  verification  methods  (such  as  [8,  9]) 
may  be  applicable.  Figure  2  shows  how  the  work  fits 
together:  arrows  going  downward  represent  implemen¬ 
tation,  arrows  upwards  verification;  solid  arrows  indi¬ 
cate  currently  available  routes,  automated  where  ap¬ 
propriate,  and  the  d2ished  arrows  represent  possible 
future  pieces  of  work.  Other  implementation  routes 
may  be  the  subject  of  future  work,  such  as  distributed 
implementations  or  kernels  based  on  other  scheduling 
mechanisms.  One  particularly  interesting  piece  of  fu¬ 
ture  work  would  be  the  integration  of  existing  formal 
methods  for  developing  sequential  code  (such  as  Z  [1 1] 
or  VDM  [12])  with  AORTA,  so  that  the  timing  of  a 
system  and  its  functional  correctness  could  be  verified 
in  a  unified  way. 


4  Current  and  future  work 
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Abstract 

We  report  our  ongoing  research  tn  real-time  com¬ 
munication  with  FDDI-based  reconfigurable  networks. 
The  oripnal  FDDI  architecture  was  enhanced  in  or¬ 
der  to  improve  its  fault-tolerance  capability  while  a 
scheduling  methodology,  including  message  assign¬ 
ment,  bandwidth  allocation,  and  bandwidth  manage¬ 
ment  u  developed  to  support  real-time  communication. 
As  a  result,  message  deadlines  are  guaranteed  even  in 
the  event  of  network  faults. 

1  Introduction 

Computer  networks  employed  in  mission-critical 
systems  must  meet  stringent  timing  requirements 
which  arise  due  to  the  communication  between  real¬ 
time  tasks  executing  on  different  network  nodes.  In 
this  piq>er,  we  report  our  work  aimed  at  address¬ 
ing  issues  related  to  fault-tolerant  raarantees  of  syn- 
dironous  message  deadlines  in  FDDI-based  networks, 
i.e.,  the  transmission  of  messages  before  their  dead¬ 
lines,  even  in  the  event  of  network  faults. 

FDDI  is  an  ANSI  standard  [4]  for  a  100  Mbits/sec 
fiber  optic  token  ring  network.  FDDI  is  a  good  can¬ 
didate  for  mission-critical  real-time  applications,  due 
not  only  to  its  high  bandwidth,  but  alro  to  its  prop¬ 
erty  of  bounded  token  rotation  time  and  its  dual  rina 
araiitecture.  The  bounded  token  rotation  time  [20] 
provides  a  necessary  condition  to  guarantee  hard  real¬ 
time  deadlines,  while  the  dual  ring  architecture  allows 
the  maintenance  of  continuous  reu-time  service  under 
some  failure  conditions.  Several  new  civilian  and  mil¬ 
itary  networks  have  adopted  FDDI  as  the  backbone 
netw<»k.  In  particular,  FDDI  has  been  adopted  by 
the  Survivable  Adwtable  Fiber  Optic  Embedded  Net¬ 
work  (SAFENET)  [18].  SAFENET  is  a  military  stan¬ 
dard  mr  computer  networks  developed  with  the  Navy’s 
Next  Generation  Computer  Resources  (NGCR)  pro¬ 
gram. 

Although  indispensable,  the  bounded  token  rota¬ 
tion  time  and  the  dual  ring  architecture  alone  are  in¬ 
adequate  for  guaranteeing  message  deadlines.  Several 
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critical  issues  must  be  addressed  in  order  to  achieve 
this  objective. 

•  Fault-tolerance  capability  must  be  enhanced. 
With  the  FDDI  architecture,  only  one  trunk  link 
fault  can  be  tolerated.  Two  trunk  link  faults 
cause  the  network  to  become  disconnected.  Fur¬ 
thermore,  under  the  current  standard,  upon  the 
occurrence  of  a  fault,  the  detection  and  recovery 
processes  may  take  several  seconds  to  complete. 
This  is  too  long  to  be  able  to  satisfy  message  dead¬ 
lines  in  many  hard  real-time  applications. 

•  Message  transmission  must  be  properly  scheduled. 
Message  scheduling  includes  the  arbitration  of 
network  access  for  each  node,  and  the  control  of 
the  length  of  transmission  by  each  node.  The 
deadlines  of  messages  can  be  met  only  if  access  ar¬ 
bitration  and  transmission  control  are  performed 
properly  [17]. 

The  above  issues  are  addressed  in  our  project.  In 
particular,  we  enhance  the  FDDI  architecture  in  order 
to  improve  its  fault-tolerance  capability.  Furthermore, 
we  develop  a  scheduling  methodology  to  ensure  the 
satisfaction  of  message  time  constraints.  This  work 
complements  previous  work  on  the  design  of  real-time 
communication  networks  [1,  6,  7,  8,  9,  12,  13,  14,  19, 
21,  23,  24,  25,  26].  For  a  recent  comprehensive  survey, 
the  rewler  is  referred  to  [17]. 

2  FBRN:  An  Enhanced  FDDI  Archi¬ 
tecture 

We  have  designed  an  enhanced  network  archi¬ 
tecture  called  FDDI-based  reconfigurable  network 
(FBRN)  [11].  An  FBRN  uses  multiple  FDDI  trunk 
rings  to  connect  network  stations.  Specifically,  n  sta¬ 
tions  are  connected  using  r  FDDI  trunk  rings.  Fig¬ 
ure  l(s)  shows  an  FBRN  with  four  FDDI  trunk  rings 
connecting  four  stations.  Each  station  has  certain  re¬ 
configuration  capabilities  that  provide  an  additional 
level  of  fault-tolerance  over  that  provided  by  the  FDDI 
wrap-up^  operation.  In  the  case  of  (multiple)  link 

*An  FDDI  trunk  ring  can  recover  from  a  single  tnink  link 
fault  by  wrapping  up  ita  dual  counter-rotating  loops  [3]. 
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faults,  our  FBRN  automatically  detects  the  occur¬ 
rence  of  faults  and  recovers  by  reconfiguring  the  trunk 
ring  connections.  This  is  achieved  by  judiciously  re¬ 
connecting  the  fault-free  sements  of  those  trunk  rinp 
that  coulanot  be  recovered  oy  the  normal  FDDI  wrap- 
up  operation.  Figure  1(c)  demonstrates  the  recon¬ 
struction  of  such  a  trunk  ring. 

Both  analytical  and  simulation  data  show  that  our 
FBRN  can  sustain  a  greater  number  of  faults  as  com¬ 
pared  to  an  ordinary  FDDI  network.  For  example,  an 
FBRN  consisting  of  20  nodes  and  4  FDDI  trunk  rings 
can  still  provide,  on  average,  200  Mbps  of  transmis¬ 
sion  bandwidth  even  if  the  network  has  suffered  more 
than  10  link  faults. 

An  FBRN  can  be  implemented  using  existing  FDDI 
concentrator  technology.  Each  node  in  the  FBRN  net¬ 
work  consists  of  an  enhanced  concentrator  through 
which  the  multiple  FDDI  trunk  rings  pass.  The  con¬ 
centrator  has  the  capability  to  reconfigure  the  con¬ 
nections  (paths)  among  its  various  ports.  A  configu¬ 
ration  monitor,  resident  on  each  FBRN  concentrator, 
uses  this  ability  to  route  packets  among  the  fault-free 
ring  segments  incident  on  the  node.  Further  details 
on  FBRN  implementation  are  discussed  in  [11]. 


Ml  IMl  Ml  IMi4  lfa*l 


(*);  Ab  FBRN  widi  11-4,  ri4. 


Figure  1:  FBRN  Architecture 


3  Scheduling  Methodology 

We  develop  a  methodology  for  scheduling  message 
transmissions  in  an  FBRN  network  so  that  message 


time  constraints  are  guaranteed.  The  methodology 
consists  of  the  following  components: 

•  Message  assignment  method.  To  provide  dead¬ 
line  guarantees  in  the  presence  of  trunk  link 
faults,  we  have  to  exploit  the  multi-ring  archi¬ 
tecture  and  the  fault  management  mechanism. 

Note  that  once  an  FDDI  trunk  ring  is  disabled 
due  to  faults,  messages  can  only  be  transmitted 
on  other  rings  before  the  faulty  ring  is  recovered. 
If  all  rings  are  fully  utilized  before  the  faults  oc¬ 
cur,  it  is  not  possible  to  transfer  message  traffic 
from  a  faulty  ring  to  a  non-faulty  ring.  Hence, 
some  of  the  messages  will  have  to  be  dropped. 
We  assume  that  when  the  network  is  in  a  faulty 
state  (i.e.,  some  of  the  FDDI  trunk  rings  are  dis¬ 
abled),  only  a  subset  of  messages  that  are  critical 
to  the  mission  will  be  transmitted.  The  dead¬ 
lines  of  criticad  messages  must  be  guaranteed  at 
all  times,  including  during  periods  of  fault  detec¬ 
tion  and  recovery.  The  objective  of  this  part  of 
the  study  is  to  properly  assign  critical  and  non- 
critical  messages  to  FDDI  trunk  rings  so  that  the 
deadlines  of  ml  messages  are  met  when  the  net¬ 
work  is  fault-free,  and  at  least  those  of  critical 
messages  are  met  when  the  network  is  faulty. 

There  are  three  possible  methods  for  achieving 
this  objective: 

-  Fully  redundant  assignment.  A  possible  so¬ 
lution  is  to  transmit  each  critical  message 
over  several  rings  so  that  when  some  of  the 
rings  are  not  available,  message  deadlines 
are  still  guaranteed  because  the  messages  are 
transmitted  via  at  least  one  available  ring. 
This  solution  is  the  simplest  but  it  results  in 
wasted  bandwidth  when  there  is  no  fault. 

-  Dynamic  reassignment.  Another  solution 
is  to  dynamically  reallocate  the  messages 
from  one  ring  to  another  once  a  fault  is  de¬ 
tected.  This  solution  allows  better  utiliza¬ 
tion  of  FDDI  rings  when  there  is  no  fault, 
but  cannot  be  applied  to  those  applications 
where  the  deadlines  of  critical  messages  are 
too  small  to  tolerate  any  delays  caused  by 
fault  detection  and  dynamic  reallocation. 

—  An  integrated  method.  An  alternative  is  to 
combine  the  fully  redundant  method  with 
the  dynamic  reallocation  method.  Criti¬ 
cal  messages  with  small  deadlines  are  as¬ 
signed  to  several  rings,  and  dynamic  reas¬ 
signment  is  performed  for  other  messages 
when  a  link  fault  occurs  on  one  ring.  This 
method  achieves  better  utilization  of  the 
network  bandwidth  than  the  fully  redundant 
method,  while  overcoming  the  shortcomings 
of  dynamic  reallocation. 

We  are  currently  comparing  and  evaluating  the 
performance  of  these  three  approaches  in  terms  of 
their  effectiveness  in  utilizing  the  network  during 
both  normal  and  faulty  conditions,  and  in  terms 
of  their  run-time  overhead. 
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•  Bandwidth  allocation  method. 

Once  a  mes^e  is  assigned  to  a  particular  FDDI 
trunk  ring,  its  deadline  cannot  be  automatically 
guaranteed,  even  though  FDDI  has  a  property 
of ‘bounded  token  rotation  time.’  Guaranteeing 
message  deadlines  is  also  dependent  on  the  appro¬ 
priate  allocation  of  the  synchronous  bandwidth 
to  the  nodes.  If  the  source  node  of  a  message  is 
allocated  insufficient  synchronous  bandwidth,  it 
may  be  unable  to  complete  the  transmission  of 
real-time  messages  before  their  deadline.  On  the 
other  hand,  allocating  excess  synchronous  band- 
widths  to  the  nodes  could  increase  the  token  ro¬ 
tation  time,  which  may  also  cause  message  dead¬ 
lines  to  be  missed. 

Over  the  past  two  yearn,  extensive  studies  have 
been  carried  out  on  this  subject.  The  first  com¬ 
prehensive  study  on  synchronous  bandwidth  allo¬ 
cation  for  FDDI  was  reported  in  [1].  In  [5],  an  op¬ 
timal  bandwidth  allocation  algorithm  was  stud¬ 
ied.  An  optimal  algorithm  can  guarantee  dead¬ 
lines  for  idl  messages  assigned  to  an  FDDI  ring 
whenever  there  is  an  allocation  method  that  can 
do  so.  However,  the  optimal  algorithm  is  com¬ 
plicated  and  may  not  be  feasible  for  on-line  use. 
Consequently,  in  [21,  a  localized  bandwidth  allo¬ 
cation  method  was  developed.  A  localized  scheme 
uses  the  information  loc^  to  a  node  to  allocate 
its  synchronous  bandwidth.  An  advantage  of  lo¬ 
calized  schemes  is  that  a  node  can  freely  change 
its  message  parameters  and  its  synchronous  band¬ 
width  (as  long  as  the  network  utilization  is  within 
the  given  bound)  without  disturbing  the  opera¬ 
tion  of  other  nodes.  In  [10,  15,  16,  22],  extensions 
were  made  to  the  case  where  messages  may  have 
arbitrary  deadlines  (i.e.,  the  deadlines  need  not 
be  equal  to  the  periods). 

•  Bandwidth  management. 

In  the  above  studies  of  synchronous  bandwidth 
allocation,  it  was  shown  that  an  FDDI  trunk  ring 
where  a  node  may  have  zero,  one,  or  more  streams 
of  synchronous  messages  can  be  transformed  into 
a  logically  equivalent  network  with  one  stream 
per  node.  This  assumption  of  one  stream  per 
node  was  used  to  simplify  the  analysis  without 
loss  of  generality.  However,  in  practice,  having 
multiple  message  streams  on  a  node  may  cause 
problems.  In  all  the  current  implementations  of 
the  FDDI  MAC  component,  a  FIFO  queue  is 
used  for  out-going  messages.  Furthermore,  the 
FDDI  MAC  component  cannot  distinguish  be¬ 
tween  messages  from  different  applications  and 
has  no  knowledge  of  the  bandwidths  allocated  to 
those  applications.  Whenever  the  token  arrives  at 
a  node,  the  MAC  transmits  messages  in  the  or¬ 
der  of  the  FIFO  queue  until  it  exhausts  its  total 
bandwidth.^  As  a  result,  if  messages  at  the  front 

^The  tynchronou*  bandwidth  allocated  to  a  node  is  the  aum 
total  of  the  bandwidths  allocated  to  the  applications  executing 
on  that  node. 


of  the  queue  are  long,  messages  in  the  later  part 
of  the  queue  may  not  be  transmitted  on  time,  and 
hence  may  miss  their  deadlines. 

In  order  to  resolve  this  problem,  we  need  to  man¬ 
age  the  use  of  bandwidth  at  run-time.  The  partic¬ 
ular  management  functions  are:  1)  to  fragment  a 
message  into  units  of  the  size  of  the  synchronous 
bandwidth  allocated  to  the  application  that  gen¬ 
erated  that  message,  and  2)  to  reorder  these  frag¬ 
ments  appropriatdy  before  transmission.  This  re¬ 
ordering  IS  done  to  arrange  the  fragments  in  a  se- 
ouence  eouivalent  to  the  transmission  sequence 
mat  would  result  if  each  of  these  applications 
were  considered  to  be  executing  on  a  separate 
node.  As  a  result  of  this  fragmentation  and  re¬ 
ordering,  each  application  is  able  to  transmit  a 
portion  of  its  messages  every  time  the  token  is 
received.  This  ensures  that  each  application  is 
allowed  to  utilize  the  synchronous  bandwidth  al¬ 
located  to  it. 

In  our  current  implementation,  the  bandwidth 
management  is  accomplished  by  adding  a  syn¬ 
chronous  server  (SS),  which  is  a  preprocessing 
module  at  the  application  layer.  The  server  re¬ 
ceives  messages  from  different  applications  on  the 
node,  fragments  the  messages,  and  reorders  the 
fragments  as  mentioned  above,  before  forwarding 
them  to  the  device  driver. 

4  Final  Remarks 

Finally,  we  would  like  to  point  out  that  the  architec¬ 
ture  of  our  FBRN  network  is  consistent  with  existing 
FDDI  hardware.  Our  scheduling  method  is  compati¬ 
ble  with  the  FDDI/SAFENET  standard.  Hence,  the 
results  obtained  from  our  work  will  be  immediately 
applicable  to  the  design  and  analysis  of  distributed 
hard  real-time  systems  where  FDDI/SAFENET  net¬ 
works  are  used.  In  fact,  SAFENET  has  adopted  our 
bandwidth  allocation  method  [16],  while  it  is  currently 
considering  our  FBRN  fault  management  method. 

References 

[1]  G.  Agrawal,  B.  Chen,  W.  Zhao,  and  S.  Davari, 
“Guaranteeing  Synchronous  Message  Deadlines 
with  the  Timed  Token  Protocol,”  Proc.  IEEE  In¬ 
ternational  Conference  on  Distributed  Computing 
Systems,  Yokohama,  June  1992. 

[2]  G.  Agrawal,  B.  Chen,  and  W.  Zhao,  “Local  Syn¬ 
chronous  Capacity  Allocation  Schemes  for  Hard 
Real-Time  Communications  with  the  Timed  To¬ 
ken  Media  Access  Control  Protocol,”  Proc.  IEEE 
INFOCOM’93,  1993. 

[3]  FDDI  Station  Management  Protocol  (SMT), 
ANSI  Standard  X3T9.5/84-89,  X3T9/92-067, 
Aug.  6,  1992. 

[4]  FDDI  Media  Access  Control  (MAC),  ANSI  Stan¬ 
dard  X3T9.5/88-139,  Rev  4.0,  Oct  29,  1990. 

[5]  B.  Chen,  G.  Agrawal,  and  W.  Zhao,  “Opti¬ 
mal  Synchronous  Capacity  Allocation  for  Hard 


61 


Real-Time  Communications  with  the  Timed  To¬ 
ken  Media  Access  Control  Protocol,”  Proc.  IEEE 
Real-Time  Sjfst  Spmp.,  1992. 

[6]  D.  Ferrari  and  D.  C.  Verma,  “A  Scheme  for  Real- 
Time  Channel  Establishment  in  Wide-Area  Net¬ 
works,”  IEEE  Journal  on  Selected  Areat  tn  Com¬ 
munications,  SAC-8:368-379,  Apr.  1990. 

[7]  D.  T.  Green  and  D.  T.  Marlow,  “SAFENET  -  A 
LAN  for  Navy  Mission  Critical  Systems,”  Proc. 
Conf.  on  Local  Computer  Networks,  Oct.  1989. 

[8]  R.  M.  Grow,  “A  Timed  Token  Protocol  for  Local 
Area  Networks,”  Proc.  Electro/82,  Token  Access 
Protocols,  May  1982. 

[9]  R.  Jain,  “Performance  Analysis  of  FDDI  Token 
Ring  Networks:  Effect  of  Parameters  and  Guide¬ 
lines  for  setting  TTRT,”  IEEE  LTS,  May  1991. 

[10]  S.  Kamat,  N.  Malcolm,  and  W.  Zhao,  “The  Prob¬ 
ability  of  Guaranteeing  Synchronous  Real-Time 
Messages  with  Arbitrary  Deadlines  in  an  FDDI 
Network,”  Proc.  IEEE  Real-Time  Spst.  Symp., 
Dec.,  1993. 

[11]  S.  Kamat,  G.  Agarwal  and  W.  Zhao,  “On  Avail¬ 
able  Bandwidth  in  FDDI-Based  Reconfigurable 
Networks,”  To  appear  in  Proc.  INFOCOM‘94, 
1994. 

[12]  J.  Lehoczky,  L.  Sha,  and  Y.  Ding,  “The  Rate 
Monotonic  Scheduling  Algorithm:  Exact  Char¬ 
acterization  and  Average  Case  Behavior,”  Proc. 
IEEE  Real-Time  Syst.  Symp.,  1989. 

[13]  C.  C.  Lim,  L.  Yao,  and  W.  Zhao,  “A  Comparative 
Study  of  Three  Token  Ring  Protocols  for  Real- 
Time  Communications,”  Proc.  11th  IEEE  Inter¬ 
national  Conf.  on  Distributed  Computing  Sys¬ 
tems,  May  1991. 

[14]  C.  L.  Liu  and  J.  W.  Layland,  “Scheduling  Al- 
mrithms  for  Multiprogramming  in  a  Hard-lteal- 
lime  Environment,”  J.  ACM,  20(1):46-81,  Jan. 
1973. 

[15]  N.  Malcolm  and  W.  Zhao,  “Guaranteeing  Syn¬ 
chronous  Messages  with  Arbitrary  Deadline  Con¬ 
straints  in  an  FDDI  Network,”  Proc.  18th  IEEE 
Conf.  on  Local  Computer  Networks,  1993. 

[16]  N.  Malcolm  and  W.  Zhao,  “The  Timed-Token 
Protocol  for  Real-Time  Communications,”  IEEE 
Computer,  27(1):35-41,  Jan.  1994. 

[17]  N.  Malcolm  and  W.  Zhao,  “Hard  Real-Time 
Communication  in  Multiple-Access  Networks,” 
To  ^pear  in  J.  Real-Time  Systems. 

[18]  U.S.  Department  of  Defense,  Survivable  Adapt¬ 
able  Fiber  Optic  Embedded  Network,  Jan.  1994. 
MIL-STD-2204A. 


[19]  J.  Ng  and  J.  Liu,  “Performance  of  Local  Area 
Network  Protocok  for  Hard-Real-Time  Applica¬ 
tions,”  Proc.  11th  IEEE  International  Conf.  on 
Distributed  Computing  Systems,  May  1991. 

[20]  K.  C.  Sevcik  and  M.  J.  Johnson,  “Cycle  Time 
Properties  of  the  FDDI  Token  Rjng  Protocol,” 
IEEE  Trans.  Software  Eng.,  13(3),  1987. 

[21]  L.  Sha,  and  S.  S.  Sathaye,  “A  Systematic  Ap¬ 
proach  to  Designing  Distributed  Real-Time  Sys¬ 
tems,”  IEEE  Computer,  26(9):68-78,  Sept.  1993. 

[22]  K.  G.  Shin  and  Q.  Zheng,  “Mixed  Time- 
Constrained  and  Non-Time-Constrained  Com¬ 
munications  in  Local  Area  Networks,”  IEEE 
Trans,  on  Communications,  44(11):1668-1676, 
1992. 

[23]  J.  A.  Stankovic  and  K.  Ramamritham,  Hard  Real- 
Time  Systems,  IEEE  Press,  1988. 

[24]  J.  A.  Stankovic  and  K.  Ramamritham,  Advances 
tn  Real-Time  Systems,  IEEE  Press,  1993. 

[25]  A.  M.  van  Tilborg  and  G.  M.  Koob,  Foundations 
of  Real-Time  Computing:  Formal  Specifications 
and  Methods,  Kluwer  Academic  Publishers,  1991. 

[26]  A.  M.  van  Tilborg  and  G.  M.  Koob,  Founda¬ 
tions  of  Real-Time  Computing:  Scheduling  and 
Resource  Management,  Kluwer  Academic  Pub¬ 
lishers,  1991. 

[27]  Q.  Zheng  and  K.G.  Shin,  “Synchronous  Band¬ 
width  Allocation  in  FDDI  Networks,”  Proc.  ACM 
First  Conference  on  Multimedia,  1993. 


62 


Session  IV: 
Timing  Analysis 

Chair:  Stuart  Faulk 

SPC 


Correlation  Analysis  Techniques  for  Refining  Execution  Time 
Estimates  of  Real-Time  Applications  * 


Rajiv  Gupta 

Dept,  of  Computer  Science 
University  of  Pittsburgh 
Pittsburgh,  PA  15260 

Abstract 

Scheduling  techniques  based  upon  worst  case  execu¬ 
tion  times,  as  are  commonly  used  tn  real-time  appli¬ 
cations,  often  result  in  severe  underutilization  of  the 
processor  resources  since  most  tasks  finish  in  much 
less  time  than  their  anticipated  worst-case  execution 
times.  In  this  paper  we  describe  techniques  for  iden¬ 
tifying  correlation  among  the  executions  of  various 
statements  within  a  program.  We  demonstrate  how 
this  information  can  be  used  to  refine  the  estimate  of 
remaining  worst  case  execution  time  of  a  real-time  task 
as  the  execution  of  the  task  progresses.  Refined  esti¬ 
mates  can  be  used  at  run-time  to  achieve  better  uti¬ 
lization  of  the  system  and  early  failure  detection  and 
recovery. 

1  Introduction 

The  success  of  real-time  application  software  de¬ 
pends  on  its  ability  to  produce  a  functionally  correct 
result  within  definite  timing  constraints.  Hatrd  real¬ 
time  applications,  especially  embedded  applications, 
interact  with  and  influence  the  environment  in  which 
they  execute.  Consequently,  safety  and  timeliness  of 
execution  are  critical  issues  since  failures  have  the  po¬ 
tential  to  cause  damage.  A  process  in  a  real-time  ap¬ 
plication  has  timing-related  constraints,  such  as  an 
Earliest-Start-Time  (EST)  and  a  Deadline  (DL).  It 
must  execute  subject  to  these  time  constraints  and 
any  failure  to  do  so,  a  deadline  failure  for  instance, 
constitutes  a  failure. 

The  traditional  approach  to  guaranteeing  deadlines 
is  to  obtain  the  worst  case  execution  time  estimates 
(WETs)  for  the  individual  processes  and  manually  lay 
out  the  various  execution  timelines.  While  this  so¬ 
lution  does  ensvre  that  the  specific  set  of  processes 
meet  their  deadlines,  there  are  several  obvious  prob¬ 
lems  with  such  an  approach.  It  does  not  scale  con- 
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veniently  as  the  system  gets  larger[8].  It  also  often 
results  in  severe  under-utilization  of  the  system  since 
tasks  typically  complete  in  much  less  time  (sometimes 
orders  of  magnitude  less)  than  their  WETs  would  indi¬ 
cate  [5].  Another  problem  with  the  above  approach  to 
scheduling  based  on  worst-ceise  execution  times  arises 
when  a  task  for  some  reason,  perhaps  owing  to  re¬ 
source  sharing  delays,  exceeds  its  WET.  This  results 
in  deadline  failure,  but  such  a  failure  is  noticed  very 
late  in  the  task’s  lifetime.  This  reduces  the  time  avail¬ 
able  to  take  remedial  action.  One  common  solution  to 
this  is  to  add  a  safety  margin  by  increasing  the  WET 
of  a  task  by  some  arbitrary  percentage.  This  approach 
further  exacerbates  the  underutilization  problem. 

In  order  to  address  the  above  problems  researchers 
have  proposed  run-time  refinement  of  execution  time 
estimates  based  upon  monitoring  information  [4]  [5] 
(7].  We  had  introduced  a  technique  called  Com¬ 
piler  Assisted  Adaptive  Scheduling  (CAADS)  [4]  using 
which  execution  times  of  the  various  parts  of  the  pro¬ 
gram  are  determined  at  run-time.  If  time  savings  are 
observed  at  run-time,  they  can  be  used  to  accommo¬ 
date  newly  arriving  tasks.  On  the  other  hand,  if  the 
estimate  of  the  remaining  worst  case  execution  time 
(RWET)  of  the  task  is  too  high,  then  the  decision  to 
abort  an  executing  task  is  made. 

The  RWET  values  will  often  be  much  higher  than 
actual  remaining  execution  times.  In  this  paper  we 
present  techniques  for  identifying  parts  of  a  program 
whose  executions  are  correlated  with  one  another.  At 
any  given  point  during  the  execution  of  a  task,  based 
upon  the  execution  path  followed  so  far  and  the  corre¬ 
lation  information,  the  execution  paths  that  can  be  fol¬ 
lowed  during  the  remainder  of  the  task  are  predicted. 
This  information  is  then  used  to  estimate  the  worst 
case  execution  time  of  the  remainder  of  the  task,  to 
perform  such  adaptations  as  may  be  required  to  ensure 
that  deadlines  are  met,  to  pre-schedule  other  related 
processes,  and  to  pre-allocate  resources  for  this  and 
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denoted  as  5  — ►  . 


other  processes. 

Ev  idence  that  a  significant  degree  of  correlation  ex¬ 
ists  in  programs  has  been  recently  provided  by  Pan  ei 
al  [6].  They  found  that  correlation  information  signif¬ 
icantly  improved  the  accuracy  of  branch  prediction. 
Their  results  show  that  as  compared  to  2-bit  counter- 
based  prediction  scheme,  the  correlation-based  branch 
prediction  achieved  11%  additional  accuracy.  In  this 
paper  we  propose  to  utilize  correlation  information  for 
improving  RWET  estimates.  In  order  to  do  so  we  must 
identify  correlation  present  in  a  program.  Correlation 
can  be  detected  at  compile-time  through  static  anal¬ 
ysis  techniques  and  at  run-time  using  profiling  tech¬ 
niques.  Surprisingly,  many  of  the  opportunities  for 
correlation  identified  by  Pan  ei  al  [6]  using  run-time 
history  can  also  be  identified  through  compile-time 
analysis  of  a  program.  However,  Pan  ei  al  [6]  did 
not  develop  any  techniques  for  identifying  correlation 
through  program  analysis.  In  this  paper  we  develop 
such  techniques. 

In  section  2  we  identify  the  type  of  correlation  in¬ 
formation  that  is  useful  for  refining  RWETs  and  we 
briefly  illustrate  how  correlation  information  is  used  to 
carry  out  the  refinement.  In  sectio.n  3  we  consider  an 
important  class  of  correlations  and  illustrate  the  prob¬ 
lems  that  a  compiler  faces  in  recognizing  these  corre¬ 
lations.  We  briefly  outline  our  approach  for  compile¬ 
time  analysis  for  detecting  these  correlations.  Due  to 
space  limitations  we  will  not  discuss  profiling  tech¬ 
niques  for  correlation  detection.  Although  it  is  obvious 
that  correlation  can  be  detected  through  profiling,  how 
to  do  so  efficiently  is  a  challenging  problem.  Therefore 
in  this  respect  our  efforts  are  focussed  on  introducing 
minimal  program  instrumentation  necessary  to  collect 
profile  data. 

2  Exploiting  Correlation  Information 
for  Refining  RWETs 

Intuitively,  the  correlation  between  two  events,  Ei 
and  Ei,  exists  if  the  outcome  of  Ei  determines  the  out¬ 
come  of  Ei  under  some  execution.  Thus,  correlation 
between  E\  and  Ei  can  only  exist  if  after  event  E\  has 
taken  place  event  Ei  can  occur.  Although  it  is  possible 
to  determine  correlation  between  arbitrary  events  in  a 
computation,  we  concentrate  on  events  that  effect  the 
execution  time  of  a  computation.  This  approach  ap¬ 
plies  to  both  sequential  and  parallel  programs.  Some 
examples  of  events  for  which  correlation  values  may 
be  useful  are  as  follows: 

•  The  correlation  between  execution  of  a  statement 
S  (or  a  set  of  statements),  and  the  true/false  eval¬ 
uation  of  a  branch  B  in  a  sequential  program. 


•  The  correlation  between  execution  of  a  statement 
S  outside  a  loop  L  and  the  number  of  loop  itera¬ 
tions,  denoted  as  5  — » Iter{L). 

•  The  correlation  between  the  call  site  of  procedure 
P  and  the  truo/false  evaluation  of  a  branch  B  in 
P,  denoted  as  CallSiie{P)  — ►  B^^^ . 

•  The  correlation  between  execution  of  a  statement 
S  and  creation  of  a  task  T  in  a  parallel  program, 
denoted  as  5  -+  Create{T). 

We  consider  the  first  kind  of  correlations  listed 
above  in  the  remainder  of  this  paper.  However,  the 
broad  principles  that  are  used  to  handle  the  first  kind 
of  correlations  are  also  applicable  to  other  kinds  of 
correlations.  The  code  fragment  shown  in  Figure  1 
illustrates  the  utility  of  correlation  information  in  es¬ 
timating  RWETs.  Each  statement  is  the  figure  is  la¬ 
beled  as  S#(time),  where  time  is  the  execution  time 
of  the  statement.  There  is  a  correlation  between  the 
execution  of  statement  SI  and  the  false  evaluation 
of  the  branch  S4  nested  inside  the  for  loop,  that  is, 
51  — ►  BO^ .  The  RWET  immediately  preceding  the 
for  loop  must  be  considered  as  400  time  units  if  no 
correlation  information  is  available  since  in  this  situa¬ 
tion  we  must  assume  that  S4  is  executed  during  each 
iteration  of  the  loop.  However,  once  having  recognized 
that  there  is  a  correlation  between  the  execution  of  SI 
and  the  false  evaluation  of  BO,  we  can  obtain  a  bet¬ 
ter  RWET  estimate.  By  recognizing  that  SI  has  been 
executed  at  run-time,  we  can  consider  the  RWET  pre¬ 
ceding  the  for  loop  to  be  300  time  units  instead  of  400 
time  units. 

SO(l):  il  (. .)  {  Sl(l);  a  =  0  } 

S2(l):  for  (  i  =  1;  i  !=  101;  i++  )  { 

S3(l):  ... 

B0(1):  il  (a!=0)  { 

S4(l):  ... 

} 

} 

Figure  1.  Esiimaiing  RWET  using  Correlation. 

The  above  example  illustrates  a  common  situation 
that  Pan  ei  al  [6]  observed  in  many  SPEC  integer 
benchmark  programs  such  as  the  gnu  C-compiler, 
eqntott,  li  etc.  Very  often  there  are  statements  that 
assign  constant  values  to  variables  that  are  typically 
flags.  Later  in  the  program  the  flag  is  checked  to  de¬ 
termine  whether  or  not  a  body  of  code  should  be  ex¬ 
ecuted. 
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3  Computing  Correlation  Information 

In  this  section  we  describe  various  situations  in 
which  there  is  a  correlation  between  the  execution  of 
an  assignment  statement  and  the  true/false  evaluation 
of  a  branch  encountered  later  in  the  program.  With 
each  different  situation  the  compiler  is  faced  with  dif¬ 
ferent  challenges.  The  solutions  to  these  challenges 
are  integrated  to  develop  an  algorithm  which  is  also 
presented  in  this  section. 

The  basic  approach  taken  by  the  compiler  is  to  first 
identify  the  conditional  branches  whose  outcome  sig¬ 
nificantly  affects  the  program  execution  time.  Next  in 
order  to  predict  each  such  branch  the  compiler  exam¬ 
ines  the  assignment  statements  that  directly  or  indi¬ 
rectly  affect  the  values  of  variables  referenced  by  the 
>ranch  condition.  In  other  words,  given  a  branch  B, 
ve  must  search  for  a  statement  S  such  that  S  — * 
xists. 

1.  Directly  Affecting  Constant  or  Non¬ 
constant  Assignment:  If  5  assigns  a  value  to  a 
variable  which  is  compared  with  a  constant  or  an¬ 
other  variable  to  determine  the  outcome  of  B,  and 
the  outcome  can  be  determined  at  compile-time 
then  we  have  identified  correlation  5  — »  B^!^ . 
The  example  in  Figure  1  illustrated  this  situa¬ 
tion.  Application  of  symbolic  evaluation  can  ex¬ 
pose  correlations  that  are  not  as  readily  observ¬ 
able  as  the  case  in  Figure  1.  For  example  in  Fig¬ 
ure  2a  the  correlation  52  -+  requires  sym¬ 
bolic  evaluation  of  the  branch  condition. 


false  in  all  iterations  but  the  last.  If  correlation 
information  was  not  avmlable,  we  would  have  as¬ 
sumed  true  evaluation  for  the  if-condition  during 
all  loop  iterations  for  computing  the  worst  case 
execution  time. 

2.  Directly  Affecting  Multiple  Assignments: 
In  some  situations  the  branch  condition  may  ref¬ 
erence  multiple  variables  and  hence  correlation 
relationship  may  involve  more  than  one  assign¬ 
ment  statement.  The  example  in  Figure  3  taken 
from  SPEC  benchmark  eqntott  illustrates  this 
situation.  The  branch  54  evaluates  to  false  if 
statements  51  and  53  are  executed,  that  is, 
51  A53  — ►  Bxec(BO^).  In  this  situation  the  com¬ 
piler  must  find  all  definitions  of  each  variable  used 
in  a  branch  that  reach  the  branch.  Next  a  com¬ 
bination  of  assignment  statements  corresponding 
to  the  variables  are  selected  such  that  the  corre¬ 
sponding  values  of  variables  enable  the  evaluation 
of  the  branch  and  the  execution  of  these  state¬ 
ments  is  not  mutually  exclusive.  Each  such  se¬ 
lection  provides  us  with  a  correlation  that  allows 
the  prediction  of  branch  outcome  under  certain 
conditions. 

SO:  if  (aa*=2)  {  SI:  aa  =  0  } 

S2:  if  (bb*=2)  {S3;  bb  =  0  } 

BO:  if  (aa!=bb)  {  S5:  _ } 

Figure  3.  Correlation  Involving  Multiple 
Assignment  Statements. 


SO:  . . . 

SI:  if  (. .)  {  S2:  y  =  a  +  1  } 

BO:  if  (y  1=  a)  {  S4:  - } 

Figure  2a.  Correlation  Detection  Requiring 
Symbolic  Evaluation. 

BO:  Bhile  (not  done)  { 

SI:  . . 

S2 :  if  ( . . )  {  S3 :  done  =  true  } 
S4:  . . 

} 

Figure  2b.  Correlation  Indicating  Loop 
Termination. 

In  example  shown  in  Figure  2b  the  correlation 
53  — ►  BQ^  can  be  detected.  The  correlation  es¬ 
sentially  identifies  the  loop  termination  condition. 
The  worst  case  execution  time  of  the  while  loop 
can  be  computed  accurately  using  this  informa¬ 
tion.  To  compute  the  worst  case  time  we  should 
assume  that  the  if-condition  in  S2  evaluates  to 


3.  Indirectly  Affecting  Assignments:  In  some 
situations  in  order  to  identify  correlations  we  need 
to  compute  a  program  data  slice  [9]  which  identi¬ 
fies  all  those  statements  that  directly  or  indirectly 
influence  the  values  of  variables  used  in  a  branch 
condition.  Consider  the  example  shown  in  Figure 

4.  The  program  slice  for  the  value  of  y  in  55  is 
includes  statements  51,  52  and  54.  By  system¬ 
atically  searching  this  slice  we  can  determine  that 
the  execution  of  51  followed  by  the  execution  of 
54  will  cause  branch  55  to  evaluate  to  true,  that 
is,  51  A  54  — ►  BO^ .  As  we  can  see  this  situation 
can’t  be  handled  by  earlier  techniques  discussed 
in  this  section. 

SO:  if  (. .)  {  SI:  x  =  1  } 
else  {  S2:  x  =  a  } 

S3:  if  (. .)  {  S4:  y  =  x  } 

BO:  if  (y==l)  {  S5:  _ } 

Figure  4-  Indirect  Correlation  Requiring 
Program  Slicing. 
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The  above  example  illustrates  that  in  order  to  de¬ 
tect  all  correlations  we  must  consider  both  statements 
that  directly  influence  the  branch  variables  and  those 
that  indirectly  influence  them.  By  combining  the  ideas 
of  symbolic  evaluation,  multiple  assignments  and  di¬ 
rect/indirect  influences  we  develop  a  general  algorithm 
for  identifying  correlations.  The  main  steps  of  the  al¬ 
gorithm  are  as  follows: 

1 .  A  control  flow  graph  (CFG)  representation  of  the 
program  is  constructed  [1]. 

2.  The  program  is  converted  to  staiic  single  assign¬ 
ment  (SSA)  form  [2].  The  conversion  of  a  pro¬ 
gram  into  this  form  guarantees  that  each  variable 
is  reachable  from  a  single  definition  of  that  vari¬ 
able,  that  is,  the  definition-use  relationships  in 
the  program  are  explicit  in  the  program.  This 
representation  simplifies  the  algorithms  used  in 
future  steps.  To  achieve  this  goal  a  0  function 
corresponding  to  a  variable  is  introduced  at  a 
join  point  if  different  definitions  of  the  variable 
reach  the  join  point  along  different  paths.  The  <t> 
function  indicates  the  selection  of  the  appropriate 
value  of  the  variable  based  upon  the  path  along 
which  the  control  reaches  the  join  point. 

3.  Using  some  simple  heuristics,  the  branch  condi¬ 
tions  whose  results  are  likely  to  affect  the  execu¬ 
tion  time  of  a  program  significantly  are  identified. 

4.  For  each  of  branch  condition,  B,  identified  in  the 
preceding  step,  we  identify  the  statements  that 
influence  its  outcome  as  follows. 

•  For  each  variable  v,  used  in  branch  B,  we  com¬ 
pute  the  corresponding  data  slice,  DS(v).  The 
data  slice  contains  all  statements  that  directly  or 
indirectly  lead  to  the  computation  of  the  value  of 
V  used  in  the  branch  condition.  Thus,  the  data 
slice  is  computed  by  taking  a  closure  of  data  de¬ 
pendences  for  the  variable  [9]. 

•  Corresponding  to  each  variable  v,  using  the  data 
slice  DS{v)  and  forward  substituting  expressions, 
we  generate  a  set  of  symbolic  expressions,  SE{v), 
that  represent  the  value  of  variable  v  at  B  under 
various  program  executions.  The  set  of  control 
conditions  that  must  hold  for  a  given  expression 
e  €  SE{v)  is  denoted  as  CC{y,e).  Depending 
upon  the  complexity  of  SE  and  CC  sets  that  one 
is  willing  to  allow,  we  can  devise  algorithms  for 
computing  SE  and  CC  sets  with  varying  degree 
of  complexity.  We  intend  to  include  only  those 
expressions  in  SE  for  which  the  corresponding 


control  condition  in  CC  includes  a  single  pred¬ 
icate.  The  reason  for  this  restriction  is  that  the 
run-time  overhead  associated  with  capturing  con¬ 
trol  conditions  at  run-time  can  be  limited  through 
this  assumption.  Special  consideration  of  loops  is 
required  during  the  generation  of  SE  sets.  Ex¬ 
pressions  from  loops  are  only  forward  substituted 
if  they  represent  invariant  computations.  This 
condition  ensures  that  the  presence  of  loops  does 
not  generate  unbounded  number  of  expressions  in 
the  SE  sets. 

•  For  each  combination  of  expressions  for  the 
branch  variables,  taken  from  their  respective  SE 
sets,  we  attempt  an  evaluation  of  the  branch  con¬ 
dition  B  using  symbolic  analysis.  Based  upon 
these  evaluations  we  prepare  a  set  of  combinations 
EVAL{B)  for  which  the  branch  was  successfully 
evaluated  using  symbolic  zuialysis. 

•  For  each  possible  combination  of  variable  expres¬ 
sions  in  EVAL(B),  identified  in  the  preceding 
step,  we  check  the  control  conditions  (CCs)  un¬ 
der  which  the  expressions  hold  to  ensure  that  the 
branch  variables  can  simultaneously  hold  these 
values  in  some  program  execution.  If  the  con¬ 
trol  conditions  can  hold  simultaneously,  we  have 
identified  a  possible  correlation  under  which  the 
outcome  of  the  branch  condition  B  can  be  pre¬ 
dicted. 

The  above  discussion  provided  the  main  steps  for 
correlation  detection.  The  details  of  the  various  steps 
are  omitted  due  to  space  limitations.  In  Figure  5  we 
present  an  example  to  briefly  illustrate  the  above  algo¬ 
rithm.  The  SSA  for  of  the  code  fragment  in  Figure  5a 
is  given  in  Figure  5b.  Due  to  the  renaming  of  variables 
and  the  introduction  of  <!>  functions,  the  data  flow  re¬ 
lationships  have  been  made  explicit.  The  SE  and  CC 
sets  for  the  three  branches  are  given  in  Figure  5c.  Us¬ 
ing  the  SE  values  we  evaluate  the  branches  and  get 
the  true/false  evaluations  of  branches  in  the  instances 
shown  in  Figure  5d.  After  checking  the  feasibility  of 
these  evaluations  we  detect  the  correlations  listed  in 
Figure  5e. 


SO: 

read  y 

SI: 

inc  =  1 

BO: 

if  (y  >  0)  {  S2 

:  z  =  1; 

S3 

:  h=2;S4:  x=y-  inc  } 

else  {  S5:z=2; 

S6:w=l;  S7:  x=y+inc  } 

Bl: 

if  (x  <  y)  {  .. 

•  } 

B2: 

if  (w  =  z)  {  .. 

•  } 

Figii 

■re  5a.  Example  Program. 
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so :  read  y 
SI:  inc  =  1 

BO:  it  (y  >  0)  {  S2:  zl=l; 

S3:  wl=2;  S4:  rl=y-iiic  } 

else  {  S6:z2=2;  S6:h2=1;  S7:z2=y+inc} 
z3  =  ^(zl,  z2) 
v3  ~  0(«1,  e2) 
x3  =  0(xl,  x2) 

Bl:  it  (x3  <  y)  {  ...  } 

B2:  il  (b3  =  z3)  {  ...  } 

Figure  5b.  SSA  form  of  the  Program. 

BO: 

SE(y)  =  {y} 

Bl: 

SE(x3)  =  {y-l.y+1}; 

CC(x3,y-l)  =  {  y>0  };  CC(x3,y+l)  =  {  -.y>0  } 
SE(y)  =  {y};  CC(y.y)  =  {true} 

B2: 

SE(b3)  =  {2,1}; 

CC(b3,2)  =  {  y>0  };  CC(b3.1)=  {  -iy>0  } 
SE(z3)  =  {1,2}; 

CC(23,1)  =  {  y>0  };  CC(23,2)  =  {  -.y>0  } 
Figure  5c.  The  SE  and  CC  Sets. 

BO:  no  evaluation  achieved. 

Bl: 

x3=y-l  => 
x3=y+l  =»  Bl'’ 

B2: 

(w3,z3)=(2,l)  =>  B2'’;  (w3,z3)  =  (2,2)  =»  B2''’ 
(w3,z3)=(l,l)  =>  B2^;  (h3,z3)  =  (1  ,2)  =i»  82^ 
Figure  5d.  Branch  Evaluation. 

BO:  no  correlation  detected. 

Bl:  S4 -+ Bl^;  S7  —  Bl'’ 

B2:  S2/S3  —  B2'’;  S6/S6  B2'’ 

Figure  5e.  Correlation  Detected. 


SO:  y  =  1 
BO:  while  (..)  { 

Bl:  if  (. .)  {  SI:  x  =  y  +  1  } 
S2:  .. 

} 

B2:  il  (x  <  3)  {  ..  } 

Figure  6.  Forward  Substitution  in  Loops. 

The  example  in  Figure  6  illustrates  the  detection  of 
correlation  in  presence  of  a  loop.  If  statement  SI  is 
loop  invariant  and  there  are  no  other  definitions  of  x  in 
the  loop,  then  we  generate  an  expression  for  the  value 


of  X  computed  at  SI.  This  results  in  the  detection 
of  correlation  SI  -+  82^ .  In  all  other  situations  the 
generation  of  expressions  through  the  loop  would  have 
been  discontinued  and  no  correlation  for  B2  would 
have  been  detected. 

4  Concluding  Remarks 

In  this  paper  we  have  introduced  compiler  tech¬ 
niques  for  detecting  and  exploiting  correlation  infor¬ 
mation  for  refining  RWETs.  We  did  not  consider 
parallel  programs  in  this  paper.  However,  methods 
for  statically  analyzing  distributed  programs  including 
slicing  algorithms  also  exist  [3].  Thus  we  believe  that 
with  further  research  our  approach  can  be  extended 
to  handle  distributed  programs. 
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Abstract 

This  paper  describes  a  timing  tool  being  developed 
by  a  real-time  research  group  at  Seoul  National  Uni¬ 
versity.  Our  focus  is  on  the  issues  resulting  from 
advanced  architectural  features  such  as  pipelined  ex¬ 
ecution  and  cache  memories  found  in  many  modem 
RlSC-siyle  processors.  For  each  architectural  feature 
we  state  the  issues  and  explain  our  approach. 


1  Introduction 

In  real-time  computing  systems,  tasks  have  tim¬ 
ing  requirements  (i.e.,  deadlines)  that  must  be  met 
for  correct  operation.  Various  scheduling  techniques 
have  been  proposed  to  guarantee  such  timing  require¬ 
ments.  In  many  cases,  these  scheduling  techniques 
require  that  the  worst  case  execution  times  (WCETs) 
of  tasks  be  known  a  priori. 

This  paper  describes  a  timing  tool  that  is  being  de¬ 
veloped  by  a  real-time  research  group  at  Seoul  Na¬ 
tional  University.  This  timing  tool  aims  at  accu¬ 
rately  calculating  guaranteed  worst  case  execution 
times  of  programs  for  computer  systems  that  use  mod¬ 
ern  RISC-style  microprocessors.  Our  particular  focus 
is  on  how  the  timing  tool  addresses  the  issues  resulting 
from  advanced  features  of  these  microprocessors  such 
as  pipelined  execution  and  cache  memory.  There  have 
been  various  approaches  to  predicting  program  execu¬ 
tion  times  [2,  5,  8,  9,  11,  12].  However,  their  machine 
models  were  mostly  ClSC-style  processors  rather  than 
RISC-style  microprocessors. 

‘This  work  was  supported  in  part  by  ADD  (Contract  ADD- 
91-4-4)  and  KOSEF  (Grant  KOSEF-93-01-00-10). 


Dept,  of  Computer  Engineering 
Chung-Ang  University 
Seoul  156-756,  Korea 

This  paper  is  organized  as  follows.  Section  2 
presents  the  overview  of  the  timing  tool.  In  section  3, 
we  explain  the  problems  in  accurately  estimating  the 
WCETs  of  tasks  in  pipelined  processors  and  present 
an  analysis  method  based  on  extended  timing  schema. 
Section  4  explains  why  an  accurate  timing  analysis  is 
difficult  in  computer  systems  with  cache  memories  and 
briefly  discusses  our  approach.  Finally,  we  conclude 
this  paper  in  section  5. 


2  Overview  of  the  timing  tool 

Our  timing  tool,  like  [7,  8],  is  based  on  the  timing 
schema  [10].  The  timing  schema  is  a  set  of  formulas 
for  computing  the  time-bounds  of  programming  con¬ 
structs.  For  example,  the  time-bound  of  S:  if  (exp) 
then  Si;  else  S2  is  computed  by  the  following  equa¬ 
tions; 

T(S)  =  r(5,Aen)  y  T(5e,„) 

T(S,Hen)  =  T{exp)-^T{tUn)-i-T{Si) 
T{Se,.e)  =  T(cip)-|-T(e/se)-(-T(52) 

where  T{exp),  T{Si)  and  T{S2)  are  the  time-bounds  of 
exp,  St,  and  S2,  respectively  and  T{then)  and  T{else) 
are  the  time-bounds  to  transfer  control  to  5i  and  S2, 
respectively.  The  operation  [+)  on  time-bounds  is  de¬ 
fined  as 

[o,  6]  y  [c,  (/]  =  [m»n(a,  c),  max{b,  d)] 

Our  timing  tool  consists  of  a  compiler  and  a  tim¬ 
ing  analyzer.  The  compiler  is  a  modified  version  of 


0-8186-5710-3/94  $3.00  @  1994  IEEE 


59 


an  ANSI  compiler  called  Icc  [1].  This  compiler  ac¬ 
cepts  a  C  source  program  and  generates  the  assem¬ 
bly  code  along  with  program  structure  information. 
Figure  1  shows  a  sample  C  program  and  the  gener¬ 
ated  assembly  code.  Also  shown  in  the  figure  is  the 
program  structure  information  in  both  textual  and 
graphical  forms.  The  timing  analyzer  uses  the  assem¬ 
bly  code  and  the  program  structure  information  along 
with  user- provided  information  (e.g.,  iteration  counts 
of  loop  statements,  WCETs  of  the  library  functions 
used  in  the  program)  to  compute  the  time-bound  of 
the  program.  The  machine  model  currently  supported 
in  the  timing  tool  is  the  MIPS  R3000  CPU. 


3  Pipelining  effects 

Due  to  data  dependencies  and  resource  conflicts 
within  the  execution  pipeline,  the  execution  time  of  a 
basic  block  will  differ  depending  on  which  basic  block 
among  the  possible  basic  blocks  was  executed  prior  to 
this  basic  block.  In  the  original  timing  schema,  it  is 
difficult  to  accurately  account  for  such  timing  varia¬ 
tions  since  the  basic  object  of  the  timing  schema  is  a 
simple  time-bound.  To  rectify  this  problem,  we  ex- 


Figure  2:  Typical  reservation  table 

tended  the  original  timing  schema.  In  the  extended 
timing  schema,  the  basic  object  is  a  reservation  table 
rather  than  a  simple  time-bound.  The  reservation  ta¬ 
ble  was  originally  proposed  to  describe  and  analyze 
activities  within  a  pipeline  [3]  (cf.  Figure  2).  In  the 
reservation  table,  the  vertical  dimension  represents  the 
stages  in  the  pipeline  and  the  horizontal  dimension 
represents  time.  The  shaded  boxes  in  the  reservation 
table  specify  the  lise  of  the  corresponding  stages  for 
the  indicated  period  of  time.  In  our  approach,  reser- 
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vation  tables  are  used  to  specify  timing  of  instruction 
executions.  In  our  timing  tool,  associated  with  each 
reservation  table  are  its  worst  and  best  case  execution 
times  that  are  denoted  by  tmax  and  <min  respectively 
in  Figure  2. 

The  use  of  reservation  tables  as  the  basic  objects 
of  the  timing  schema  allows  us  to  rewrite  the  timing 
schema  in  such  a  way  that  takes  into  account  the  tim¬ 
ing  variation  due  to  dependencies  between  program¬ 
ming  constructs.  For  example,  in  the  extended  timing 
schema,  the  timing  schema  of  S:  Sj;  is 

R(S)  =  {r|r  =  n  e  rj,  n  e  R(Si),  ra  €  ^(^2)} 

where  0  is  an  operation  that  concatenates  two  reser¬ 
vation  tables  giving  another  reservation  table.  R(Si) 
is  the  set  of  reservation  tables  corresponding  to  the 
set  of  the  execution  paths  that  might  take  the  longest 
time  among  the  possible  execution  paths  in  5i.  R(S2) 
is  defined  similarly.  During  each  instantiation  of  the 
above  timing  schema,  a  check  is  made  to  see  whether 
the  resulting  set  of  reservation  tables  can  be  pruned.  A 
reservation  table  can  be  pruned  if  the  worst  case  exe¬ 
cution  time  of  the  reservation  table  is  shorter  than  the 
best  case  execution  time  of  another  reservation  table 
in  the  same  set.  Note  that  in  such  a  case  the  execution 
path  corresponding  to  a  pruned  reservation  table  can¬ 
not  be  the  worst  case  execution  path.  Figure  3  shows 
an  example  of  such  pruning. 

Likewise,  the  timing  schema  of  an  if  statement  S; 
if  (exp)  then  Si;  else  S2  is 

R{S)  =  {ra|ra  =  r<,jp0ri,  r^ip  e  R(exp),  fj  € 

U 

{Tb\Tb  =  Teip  ©rj,  Teip  €  R(exp),  €  1?{52)) 

where  (J  is  the  set  union  operation.  As  in  the  previous 
example,  pruning  is  performed  each  time  a  new  set  of 
reservation  tables  is  derived. 

In  the  current  implementation,  not  all  the  columns 
in  the  reservation  table  are  stored.  Instead  only  a  first 
few  columns  whose  timing  behavior  may  be  affected 
by  the  preceding  basic  block  and  a  iMt  few  columns 
who  may  affect  the  timing  behavior  of  the  succeeding 
basic  block  are  maintained. 

4  Cache  Memories 

Our  timing  tool  currently  does  not  support  cache 
memories  that  have  been  extensively  used  to  bridge 
the  speed  gap  betv.'een  the  processor  and  main  mem¬ 
ory.  In  a  cache-based  computer  system,  we  need  to 


Figure  3;  Example  of  pruning 


know  the  cache  hit  or  miss  of  each  memory  reference 
to  locate  the  worst  ceise  execution  path.  Unfortunately 
such  information  is  known  only  after  Ihe  worst  case  ex¬ 
ecution  path  has  been  found  due  to  history-sensitive 
nature  of  caches.  This  cyclic  dependency,  in  many 
cases,  yields  a  pessimistic  estimation  of  WCETs  [6]. 
To  rectify  the  problem  resulting  from  the  cyclic  de¬ 
pendency,  we  again  extended  the  timing  schema  [4]. 
We  will  briefly  describe  the  technique  in  the  follow¬ 
ing. 

In  the  proposed  technique,  associated  with  each 
statement  is  a  set  of  atomic  objects,  each  of  which 
abstracts  an  execution  path  in  the  statement.  Each 
atomic  object  consists  of  two  sets  of  memory  block 
addresses  and  an  execution  time  estimate.  The  first 
set  maintains  the  memory  block  addresses  of  the  ref¬ 
erences  whose  hits  or  misses  depend  on  the  cache  con¬ 
tents  before  the  statement.  In  other  words,  this  set 
maintains  for  each  cache  block  the  memory  block  ad¬ 
dress  of  the  first  reference  to  the  cache  block.  The  sec¬ 
ond  set  maintains  the  addresses  of  the  memory  blocks 
that  will  remain  in  the  cache  after  the  execution  of 


the  statement.  In  other  words,  this  set  maintains  for 
each  cache  block  the  memory  block  address  of  the  last 
reference  to  the  cache  block.  This  is  the  cache  con¬ 
tents  that  will  determine  the  hits  or  misses  of  memory 
references  from  succeeding  statements.  The  execution 
time  estimate  is  an  estimated  time  needed  to  execute 
the  statement.  In  this  estimate,  correctly  accounted 
for  are  the  “guaranteed”  hits  and  misses.  However,  the 
memory  references  whose  hits  or  misses  are  not  known 
(i.e.,  those  in  the  first  set)  are  conservatively  assumed 
to  miss  in  the  cache  in  the  initial  estimate.  This  ini¬ 
tial  estimate  is  later  refined  as  the  hits  or  misses  of 
those  references  are  known  in  a  later  stage  of  analy¬ 
sis.  This  framework  allows  us  to  rewrite  the  timing 
schema  so  as  to  accurately  analyze  the  timing  behav¬ 
ior  of  cache  memories.  A  detailed  discussion  of  the 
resultant  timing  schema  is  beyond  the  scope  of  this 
paper  and  interested  readers  are  referred  to  [4].  How¬ 
ever,  it  is  worth  mentioning  that  the  resulting  timing 
schema  is  very  similar  to  the  one  given  in  the  previous 
section  and  that  the  two  timing  schemas  can  easily  be 
combined. 

5  Conclusion 

In  this  paper,  we  explained  the  difficulty  of  ac¬ 
curately  estimating  the  time-bounds  of  programs  in 
RISC-based  computer  systems  by  using  as  an  exam¬ 
ple  the  timing  tool  we  are  currently  developing.  Our 
particular  focus  was  on  pipelined  execution  and  cache 
memories  that  are  typical  of  RISC-based  computer 
systems. 

On  the  pipelined  execution,  we  explained  the  limi¬ 
tation  of  the  original  timing  schema  and  described  an 
extension  of  it  that  is  batsed  on  reservation  tables.  On 
the  memory  hierarchy,  we  explained  why  an  accurate 
timing  analysis  is  difficult  in  computer  systems  with 
cache  memories  and  described  our  approach. 

We  expect  that  an  accurate  analysis  of  combined 
effects  of  pipelined  execution  and  cache  memory  on 
program  execution  time  will  be  possible  when  the 
planned  extension  of  our  timing  tool  is  completed. 
This  will  allow  RISC-style  processors  to  be  widely  used 
in  real-time  systems  without  worrying  about  their  un¬ 
predictable  worst  case  performance. 
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Abstract 

This  paper  illusiraies  a  formal  technique  for  de¬ 
scribing  the  timing  properties  and  resource  constraints 
of  pipelined  superscalar  processor  instructions  at  high 
level.  Superscalar  processors  can  issue  and  execute 
multiple  insiruciions  simultaneously.  The  degree  of 
parallelism  depends  on  the  multiplicity  of  hardware 
functional  units  as  well  as  data  dependencies  among 
instructions.  Thus,  the  timing  properties  of  a  super¬ 
scalar  program  is  difficult  to  analyze  and  predict. 

We  describe  how  to  model  the  instruction-level  ar¬ 
chitecture  of  a  superscalar  processor  using  ACSR  and 
how  to  derive  the  temporal  behavior  of  an  assembly 
program  using  the  ACSR  laws.  The  salient  aspect 
of  ACSR  is  that  the  notions  of  time,  resources  and 
priorities  are  supported  directly  in  the  algebra.  Our 
approach  is  to  model  superscalar  processor  registers 
as  ACSR  resources,  instructions  as  ACSR  processes, 
and  use  ACSR  priorities  to  achieve  maximum  possible 
instruction-level  parallelism. 

1  Introduction 

Instruction-level  parallelism  is  widely  used  in  super¬ 
scalar  processors  to  improve  execution  speed.  Super¬ 
scalar  processors  realize  instruction-level  parallelism 
by  replicating  functional  hardware  and  by  overlapping 
instruction  execution  stages  in  pipeline  [7,  5,  8].  Con¬ 
sequently,  multiple  instructions  can  be  issued  2U]d  ex¬ 
ecuted  simultaneously  in  superscalar  processors.  The 
degree  of  parallelism  depends  o"  the  multiplicity  of 
hardware  functional  units  as  w'  as  data  dependen¬ 
cies  among  instructions.  One  of  difficulties  in  using 
superscalar  processors  for  time  critical  applications  is 
that  it  is  difficult  to  predict  the  timing  behavior  of 
programs. 

Our  goal  is  to  augment  the  Instruction  Set  Archi¬ 
tecture  (ISA)  level  [3]  description  with  timing  proper¬ 


ties  and  resource  constraints  using  a  formal  technique 
based  on  process  algebra.  We  use  ACSR,  Algebra  of 
Communicating  Share  Resources,  because  it  includes 
the  notions  of  time,  resources  and  priorities.  There  are 
several  advantages  for  using  ACSR.  First,  the  notion 
of  ACSR  resource  facilitates  the  modeling  of  proces¬ 
sors  and  instructions.  A  superscalar  processor  can  be 
defined  as  a  set  of  reusable  resources  such  as  registers. 
Then,  an  instruction  can  be  represented  by  a  process 
which  acquires  and  consumes  a  subset  of  resources  in 
time.  Second,  the  concept  of  maximal  resource  uti¬ 
lization  which  is  a  fundamental  idea  for  pipeline  and 
superscalar  processors,  can  be  described.  This  is  done 
using  the  notion  of  ACSR  priority  which  allows  the 
specification  of  scheduling  among  several  possible  al¬ 
ternatives.  Third,  ACSR  has  a  proof  technique  that 
can  be  used  to  verify  properties  of  instruction  speci¬ 
fications  written  in  ACSR.  There  also  is  an  avmlable 
tool,  called  VERSA  [2],  which  allows  the  programmer 
to  interactively  execute,  analyze,  and  rewrite  ACSR 
specifications. 

Our  formal  processor  specification  based  on  ACSR 
is  useful  in  several  areas.  For  instance,  the  specifica¬ 
tion  provides  an  assembly  programmer  precise  mean¬ 
ing  of  instructions  with  respect  to  timing  behavior  and 
resource  use.  This  information  is  essential  when  a  pro¬ 
grammer  wants  to  use  the  processor  for  time  critical 
systems.  Furthermore,  the  specification  ceui  serve  as 
documentation  between  instruction  set  designer  and 
hardware  implementor.  Since  the  ACSR  interpreter 
already  exists,  the  development  of  a  timing  analyzer 
for  a  new  superscalar  processor  requires  only  a  simple 
translator  from  instructions  to  ACSR  processes.  Fi¬ 
nally,  a  translated  superscalar  program  helps  one  to 
analyze  a  sequence  of  instructions  in  terms  of  data  de¬ 
pendencies.  Such  data  flow  information  is  useful  in 
code  generation  and  optimization  for  compilers. 

Our  work  was  inspired  by  the  pioneering  work  by 
Harcourt  et  al.  which  uses  a  process  algebra  called 
sees  for  instruction  specification  [4].  Our  approach 
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differs  from  theirs  as  follows.  Since  SCCS  does  not 
have  the  notion  of  resources,  they  represent  each  re¬ 
source  using  a  binary  semaphore  process.  The  result¬ 
ing  specification  becomes  quite  complicated  and  cum¬ 
bersome.  Our  specification  is  clear  and  natural,  be¬ 
cause  of  the  explicit  notion  of  resources  in  the  algebra. 
Furthermore,  since  SCCS  does  not  have  priority  con¬ 
cept,  they  adapted  a  priority  operator  developed  for 
CCS  [1]  in  the  context  of  SCCS.  On  the  other  hand, 
ACSR  has  the  built-in  notion  of  priority,  which  can 
he  used  to  model  the  maximum  parallel  and  pipeline 
execution  of  instructions. 

To  illustrate  our  approach,  this  paper  uses  a  hypo¬ 
thetical  superscalar  processor,  called  ToyP,  developed 
by  Harcourt  et  al.  [4].  The  ToyP  processor  includes 
many  features  of  commercial  processors,  such  as  de¬ 
layed  loads  and  branches,  interlocked  floating-point  in¬ 
structions,  and  multiple  instruction  issue.  To  simplify 
our  presentation,  we  model  only  integer  instructions 
such  as  add,  move,  load,  store  instructions.  We  as¬ 
sume  that  add  and  move  instructions  perform  its  task 
in  a  single  instruction  cycle,  whereas  memory-related 
instructions,  load  and  store,  need  two  instruction  cy¬ 
cles. 

The  rest  of  the  paper  is  organized  as  follows.  In 
Section  2,  we  introduce  a  hypothetical  ToyP  super¬ 
scalar  processor  and  present  a  subset  of  ACSR  and 
review  some  basic  properties.  Section  3  describes  our 
approach  for  specifying  instructions  by  using  ACSR. 
In  Section  4,  we  demonstrate  how  a  ToyP  program  can 
be  translated  into  ACSR  and  its  execution  sin  «.lated. 
Section  5  summarizes  the  paper  and  describes  plans 
for  future  work. 


2  ACSR  for  ToyP  Processor 

This  section  introduces  the  ToyP  superscalar  pro¬ 
cessor  and  our  basic  formalism  ACSR,  Algebra  of 
Communicating  Shared  Resource. 

ToyP;  a  Simple  32-bit  Processor.  ToyP  was  de¬ 
signed  by  Harcourt  ei  al.  to  illustrate  the  specification 
of  instruction-level  parallelism  [4].  ToyP  is  a  simple 
hypothetical  RISC  with  32-bit  instructions,  memory 
word  size,  registers  and  addresses.  To  simplify  our 
presentation,  we  only  consider  a  subset  of  ToyP  in¬ 
structions;  add,  mov,  load,  store. 


add 

Ri, 

Ri. 

Rk 

Ri  <—  Rj  -(-  Rk 

mov 

Ri, 

Rj 

Ri  Rj 

load 

Ri, 

Rji 

#c 

Ri  «—  Mem[Rj  -b  c] 

store 

Ri, 

Ri. 

#C 

Mem[Rj  -I-  c]  Ri 

The  instructions  add  and  mov  perform  its  task  in  a 
single  instruction  cycle,  while  memory-related  instruc¬ 
tions,  load  and  store,  need  two  instruction  cycles. 
Thus,  when  a  load  or  store  instruction  is  executed, 
the  result  is  avmlable  after  2  instruction  cycles. 

Algebra  of  Communicating  Shared  Resource. 
ACSR  is  a  real-time  process  algebra  that  incorporates 
the  notions  of  communication,  concurrency,  resources, 
and  priorities  into  a  single  formalism.  One  of  the  im¬ 
portant  concepts  in  ACSR  is  shared  resources.  We 
briefly  describe  the  subset  of  ACSR  which  we  use  to 
model  microprocessor  instructions;  the  detailed  de¬ 
scription  and  semantics  of  ACSR  can  be  found  in  [6]. 

We  consider  a  system  to  be  composed  of  a  finite  set 
R  of  registers  with  priority  1  in  ToyP  processor.  An 
action  is  defined  as  a  subset  of  R,  that  consumes  one 
cycle  of  time.  As  an  example,  the  singleton  action, 
{(r,  1)},  denotes  the  use  of  some  register  r  G  R.  For 
the  simplicity,  we  omit  the  priority  from  actions  here¬ 
after.  The  action  0  represents  idling  for  one  time  unit, 
since  no  resource  is  being  used.  We  let  A,  B,  C  range 
over  actions. 

The  syntax  of  ACSR  processes  containing  actions 
is  as  follows: 

P  ;:=  NIL  I  A  :  P  1  P1+P2  \  [P]/  |  P1IIP2  1  rcc  X.P  |  X 

NIL  is  a  process  that  executes  no  action  (i.e.,  it  is 
deadlocked).  There  is  one  prefix  operator.  A  :  P  ex¬ 
ecutes  a  resource-consuming  action  A,  consumes  one 
time  unit,  and  proceeds  to  the  process  P.  The  Choice 
operator  Pi  P2  represents  nondeterminism  -  either 
of  the  processes  may  be  chosen  to  execute,  subject  to 
the  resource  limitations  of  the  environment.  The  op¬ 
erator  P1IIP2  is  the  concurrent  execution  of  P\  and 
P2.  The  Close  operator,  [P]i,  produces  a  process  P 
that  monopolizes  the  resources  in  7  C  R.  The  pro¬ 
cess  rec  X.P  denotes  standard  recursion,  allowing  the 
specification  of  infinite  behavior. 

We  denote  Done  as  rec  X.0  :  X .  The  process  Done 
has  the  following  identity  property  with  respect  to  the 
parallel  composition  operator  I]:  For  every  process  P, 
P II  Done  =  P.  We  introduce  one  binary  combinator 
next  which  is  used  to  model  the  issuing  of  instructions 
in  consecutive  cycles. 

Definition  2.1  For  any  A,  P,  Q,  the  next  operator 
is  defined  as  follows: 

{A  :  P)nextQ=^  A  :  (P((g). 
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add  Ri,  Rj,  Hk 

drf 

5Zl<f<6 

mov  Ri,  Rj 

<kt 

IIl<l<6 

load  Hi,  Rj,  #c 

def 

X2i<1<6 

store  Hi,  Rj,  #c 

<tef 

5Zi<»<6 

I^i<m<6{  :  Done 

{  Ri,  Rji}  :  Done 
{  Ri,  Rjj}  :  {Ri}  ;Done 
Ei<m<6{  Him  }:  {Ri|}  :  Done 


Table  1:  Modeling  ToyP  Instructions  Using  ACSR 


3  Modeling  ToyP  Instructions  Using 
ACSR 

We  model  a  ToyP  instruction  as  an  ACSR  process 
and  hardwue  components  needed  for  execution  of  in¬ 
structions  as  resources  in  ACSR.  For  this  paper,  we 
need  to  consider  only  integer  registers  as  resources. 
An  instruction  is  modeled  by  an  ACSR  process  which 
specifies  the  behavior  of  the  instruction  in  terms  of 
resource  constraints  and  temporal  properties,  that  is, 
which  and  when  resources  are  needed. 

There  are  two  kinds  of  operations  on  an  integer 
register;  read  and  write.  A  register  can  be  shared  by 
the  executions  of  several  instructions  if  all  of  them  are 
reading  a  value  from  the  register.  However,  only  one 
instruction  can  use  a  register  if  an  instruction  is  trying 
to  write  a  value  into  the  register.  If  an  instruction 
needs  a  register  whose  use  is  not  compatible  with  a 
currently  executing  instruction,  then  the  execution  of 
the  new  instruction  must  be  delayed. 

Since  ACSR  resources  are  serially  reusable,  we  rep¬ 
resent  each  register  as  consisting  of  multiple  ports  for 
read  and  write.  The  number  of  ports  depends  on  how 
many  instructions  can  be  issued  and  executed  in  par¬ 
allel.  We  assume  that  ToyP  can  issue  and  execute  up 
to  three  integer  instructions  in  a  single  cycle  if  there 
are  no  data  hazard.  Since  each  integer  instruction  can 
read  a  single  register  twice  (e.g.,  in  add  R2,  RI,  RI), 
each  register  can  be  accessed  by  6  read  requests  in  the 
same  cycle.  Instead  of  modeling  registers  themselves, 
we  model  these  6  read/write  (serially  reusable)  ports 
for  each  register.  Here,  a  process  can  read  a  value  if 
it  acquires  2uiy  one  of  six  ports,  whereas  a  process  can 
write  a  value  only  if  the  process  acquires  eil)  six  ports 
of  the  register.  Hence,  when  a  process  holds  a  register 
for  writing,  other  processes  cannot  share  the  register. 
Rij  denotes  the  port  j  of  the  register  i.  For  the  sake  of 
simplicity,  we  sometimes  use  Ri  to  mean  all  6  ports  of 
the  register  i,  that  is,  R[  =  {Rii,  Ri2i  Ris.  HU.  His, 
His}.  Hence,  {Rji,  Hi}  stands  for  (Rji,  Hii,  Hij,  HU, 
RUi  His,  His}. 

We  model  the  ToyP  instructions  using  ACSH  as  in 


table  1. 

To  simplify  the  presentation,  we  choose  one  al¬ 
ternative  from  the  several  choices  for  the  translation 
of  each  instruction.  For  example,  we  translate  add 
RI,  R2,  R3  into  {RI,  R2i,  R3i}:Done  instead  of 
Si<m<6{  HI,  H2|,  R3m}  :  Done. 

Using  the  above  translation  scheme  for  ToyP  in¬ 
structions,  a  ToyP  program  can  be  represented  by  a 
set  of  ACSH  processes.  The  resulting  set  of  ACSH 
processes  is  called  a  program  specification.  The  next 
example  illustrates  how  to  translate  a  ToyP  programs 
to  a  program  specification. 

Example  3.1  The  following  program  is  assumed  to 
be  loaded  into  the  memory  starting  at  the  location 
PC: 

PC  ;  add  HI,  HI,  HI 
PC+4  :  load  R2,  H3,  #8 
PC-l-8  :  add  HI,  H3,  H3 


The  above  program  sequence  can  be  represented  by 
the  following  ACSH  processes; 


Mem(PC) 

{  HI  }  ;  Done 

Mem(PC-l-4) 

{  R2,  R3i  }  ;  {R2}  ;  Done 

Mem(PC-l-8) 

<^f 

{  HI,  R32,  RSa}  :  Done 

Mem(PC+12) 

def 

NIL 

4  Modeling 

Superscalar  ToyP  execu 

tion 


This  section  describes  how  to  model  the  superscalar 
aspect  of  ToyP  that  can  issue  multiple  instructions  per 
cycle  using  ACSR.  The  ToyP  processor  can  issue  and 
execute  multiple  (up  to  3)  instructions  at  the  same 
time.  In  the  superscalar  architecture,  hardware  deter¬ 
mines  register  use  conflict  (called  data  hazard)  among 
instructions.  A  ToyP  program  contains  no  explicit 
synchronization  information  to  prevent  data  hazard. 
Hence,  in  order  to  predict  the  timing  behavior  of  a 
ToyP  program,  it  is  necessary  to  be  able  to  determine 
when  there  are  data  hazard. 
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Insts(PC)  0  :  In8t(PC)  +  Super-Inst8(PC) 

Super-Insts(PC)  **=  (Mem(PC)  ||  Mem(PC+4)  ||  Mem(PC+8)) 

next  Insts(PC+12) 

+  (Mem(PC)  II  Mein(PC+4))  next  Inst8(PC+8) 
+  Mem(PC)  next  lnsts(PC+4) 

Program  [Super-Insts(PC)]/i 

Figure  1;  Modeling  the  Execution  of  ToyP  Processor 


Data  Hazard.  A  value  generated  by  execution  of 
an  instruction  is  used  by  other  instructions,  and  such 
a  flow  of  data  during  program  execution  creates  data 
dependencies.  A  sequence  of  instructions  with  data 
hazard  cannot  be  executed  in  a  different  order  or  con¬ 
currently.  Hence,  they  must  be  executed  in  the  given 
order  in  the  instruction  sequence.  Example  3.1  con¬ 
tains  data  hazard  on  register  R1  since  it  is  write  ac¬ 
cessed  by  the  first  instruction  and  read-accessed  by 
the  third  instruction.  Data  hazard  such  as  read-after- 
write,  write-after-read,  write-after-write  hazard  is  pre¬ 
sented  as  a  resource  conflict  in  ACSR,  hence  NIL. 

The  semantics  of  ACSR  prevents  concurrent  execu¬ 
tions  of  instructions  with  potential  data  hazard  as  can 
be  seen  from  the  following  lemma. 

Lemma  4.1  Given  two  ACSR  processes,  A  :  P  and 
B  :  Q,  that  represent  two  integer  instructions,  A  : 
P\\B  :  Q  =  NIL  iff  A  :  P  and  B  :  Q  have  a  data 
hazard. 

For  example,  the  sequence  of  instructions,  add  R2, 
Rl,  R1  ;  add  Rl,  R3,  R3,  have  write-after-read  haz¬ 
ard.  The  parallel  composition  of  two  corresponding 
ACSR  processes  is  as  follows: 

{  R2,  Rli,  RI2}  :Done  ||  {  Rl,  R3i,  R32}  :Done 
Since  Rl  is  an  abbreviation  of  the  collection  for  {Rli, 
RI2,  Rls,  RI4,  RI5,  Rlfi  },  the  resources  Rli  and 
RI2  appear  in  both  the  left  and  right  processes  of  par¬ 
allel  composition  operator.  Thus,  the  above  parallel 
process  has  resource  conflict,  that  is,  the  process  is 
equivalent  to  NIL  by  expansion  law. 

Execution  of  a  ToyP  Program.  Given  a  ToyP 
program,  the  ToyP  processor  executes  as  m^uly  in¬ 
structions  as  possible  at  each  instruction  cycle.  To 
model  such  execution  behavior,  we  define  an  indexed 
set  of  ACSR  processes,  called  Super-Insts(PC)  where 
PC  is  an  index  variable  denoting  the  memory  loca¬ 
tion  of  the  current  instruction.  As  shown  in  Figure  1, 
Super-Inst8(PC)  specifies  the  possibilities  of  executing 
three  instructions,  two  instructions  or  one  instruction. 


The  choice  among  these  three  is  explained  in  the  next 
paragraph.  After  one  time  unit,  Super-Insts(PC)  de¬ 
termines  whether  or  not  the  next  instruction  can  be 
executed  as  specified  by  next  Inst8(PC-l-12).  If  there 
exists  data  conflict  between  a  not-yet  completed  in¬ 
struction  and  the  next  instruction,  the  execution  of 
the  next  instruction  is  delayed.  This  delay  possibility 
is  specified  by  the  left  choice  in  the  definition  of  In- 
st8(PC).  If  there  is  no  conflict,  the  next  instruction  is 
chosen  for  execution. 

The  process  Program  defines  the  behavior  of  the 
corresponding  ToyP  program.  Note  that  in  the  defi¬ 
nition  of  Program,  the  process  Super-Insts(PC)  is  de¬ 
fined  with  the  Close  operator  with  the  resource  set 
R. 

The  process  Super-Insts(PC)  has  three  choices. 
These  choices  represent  the  scheduling  of  ToyP’s  exe¬ 
cution:  issuing  and  executing  of  up  to  three  instruc¬ 
tions  simultaneously.  When  the  sequence  of  instruc¬ 
tions  contains  data  hazard,  the  corresponding  ACSR 
term  becomes  NIL  by  Lemma  4.1.  Thus,  data  hazeird 
terms  are  eliminated  during  the  expansion  of  Super- 
Insts(PC). 

The  next  question  is  how  to  ensure  that  as  many  in¬ 
structions  as  possible  are  executed  at  each  cycle.  This 
is  the  reason  why  the  process  Program  is  defined  as 
closed  Super-Insts(PC).  Even  after  the  impossible  ex¬ 
ecution  choice  due  to  data  hazard  is  eliminated,  the 
process  Super-In8t(PC)  can  still  have  multiple  choices. 
In  such  case,  the  notion  of  preemption  in  ACSR  [6]  al¬ 
lows  the  selection  of  a  choice  with  the  most  number  of 
instructions. 

One  of  the  most  useful  laws  for  process  algebras 
is  the  expansion  law,  which  can  be  used  to  eliminate 
parallel  operators.  Given  a  ToyP  program  specifica¬ 
tion,  we  can  uso  the  expansion  law  and  other  ACSR 
laws  to  convf  a  ACSR  process  which  does  not 

contain  any  j  operators.  The  resulting  process 

describes  all  po;,.,.ole  behaviors  of  the  original  ToyP 
program  and  also  facilitates  the  analysis  of  temporal 
properties. 


Example  4.1  This  example  illustrates  how  to  simu¬ 
late  the  ToyP  program  described  in  Example  3.1.  By 
law,  we  have  Mem(PC)  ||  Mem(PC^-4)  ||  Mem(PC-|-8) 
=  NIL,  since  there  exists  resource  conflict  between 
first  and  third  instructions  (as  well  as  second  and  third 
instructions).  Thus,  the  Super-Inst8(PC)  process  has 
the  following  expression  after  some  rewriting 
Program 

=  [Super-Insts(PC)]ii 

=  NIL  -I-  {Rl,  R2,  R3i};{R2}:Done  next  Insts(PC+8) 
-f  {Rl};Done  next  Insts(PC-f-4)  ]n 
Since  we  have  the  following  priority  relation:  [R1]a  -< 
[Rl,  R2,  R3i]ii,  by  the  law  of  prioritized  choice,  we 
have 

Program 

=  {Rl,  R2,  R3i}:{R2}:Done  next  Inst(PC+8) 

=  {Rl,  R2,  R3i}:({R2}.Done  ||  {Rl,  RSj,  R33};Done) 
=  {Rl,  R2,  R3i}:{R2,  Rl,  RSa,  RSal'.Done 

Therefore,  the  ToyP  processor  issues  and  executes  the 
first  two  instructions  simultaneously.  It  leaves  the 
third  instruction  for  the  next  cycle,  because  of  data 
hazard. 


5  Conclusion 

In  this  paper  we  have  presented  a  technique  for 
specifying  the  temporal  properties  and  reson  con¬ 
straints  at  instruction-level  parallelism  us  SR. 

We  illustrate  our  approach  using  a  simple  ;dar 

processor,  called  ToyP.  Our  approach  is  co  cuasider 
the  ToyP  processor  as  a  set  of  resources,  such  as  inte¬ 
ger  registers.  The  resource  constraints  of  an  instruc¬ 
tion  specify  a  sequence  of  sets  of  resources  required 
by  the  'nstruction,  one  set  for  each  instruction  cycle. 
Each  ToyP  instruction  is  translated  to  a  corresponding 
ACSR  process.  A  ToyP  program  is  translated  to  an 
indexed  set  of  ACSR  processes.  We  obtain  the  timing 
properties  of  the  original  ToyP  program  by  simplifying 
the  corresponding  ACSR  processes  using  ACSR  laws. 

We  are  currently  experimenting  with  ToyP  instruc¬ 
tion  specifications  using  ACSR  tool-kit,  VERSA  [2]. 
We  also  have  modeled  various  floating-point  instruc¬ 
tions  and  out-of-order  instruction  sequence  execution. 
We  are  currently  investigating  more  complex  super¬ 
scalar  architecture  such  as  caches  and  pipelines.  We 
also  are  working  on  the  formal  specifications  of  various 
branch  instructions,  conditional  instructions  as  well  as 
on  automatic  derivation  of  instruction  scheduling  pa¬ 
rameters. 
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Abstract 

Scheduling  of  tasks  onto  multi-processors  is  an 
increasingly  important  problem  in  the  simulation  of 
avionics  systems.  The  problem  is  d^cult  due  to  the  many 
hard  retd-time  constraints  imposed  on  the  schedule  in  the 
form  of  processor  frame-time  limits  and  latency 
requirements.  In  this  paper,  we  present  a  solution  to  this 
real-time  scheduling  problem  using  simulated  annealing 
techniques.  The  running  time  of  the  algorithm  is  fast 
enough  for  it  to  be  a/qdied  in  the  rapid  reconfiguradon  of 
simuladon  test  benches  in  use  at  Boeing  Flight  Systems 
Laboratory.  Its  efficacy  is  demonstrated  using  an  example 
with  60  tasks  communicating  through  1800  common 
Nocks  and  scheduled  onto  6  processors  under  4  latency 
constraints  which  achieved  a  utUitation  factor  over  95%. 
Such  an  example  can  be  scheduled  in  approximately  35 
minutes  of  CPU  time  on  an  HP-Apollo  425  workstation. 

1:  Introdiictioo 

Simulation  is  a  rapidly  growing  part  of  the  process  of 
building  new  aircraft  The  complexity  of  modem  aircraft 
systems,  especially  their  avionics,  requires  lengthy  and 
comprehensive  testing  before  the  Hrst  flight  occurs.  In 
testing  aircraft  systems,  individually  and  together, 
simulation  is  a  very  effective  alternative  to  traditional  on¬ 
ground  and  flight  testing,  in  terms  of  both  cost  and  safety. 
Unfortunately,  real-time  simulation  of  avionics  systems  in 
a  multi-processor  environment  is  a  complex  and  diflicuk 
task.  One  of  the  many  challenges  is  the  allocation  of  (asks 
to  processors  in  a  manner  that  efficiently  utilizes  the 
computing  capacity  and  also  meets  the  timing 
requirements  of  the  r^-time  environment  The  particular 
enviioiunent  we  are  cmicemed  with  statically  assigns  tasks 
to  processors  in  a  fixed  execution  order. 

An  avionics  system  can  be  thought  of  as  a  collection  of 
intercomiected  processing  units,  called  Line  Replaceable 
Units  (LRU).  An  LRU  jvovides  a  particular  function, 
such  as  an  auto-pilot  that  can  be  swapped  into  and  out  of 
an  aircraft  for  servicing  or  maintenance.  Simulations  are 
often  organized  in  terms  of  LRU  tasks  allowing  easy 
insertion  and  deletion  of  tasks.  As  hardware  is 
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constructed,  it  is  tested  in  its  operating  environment  by 
insertion  into  the  simulatirai  replacing  the  corresponding 
software  LRU.  There  may  even  be  multiple  software 
models  of  an  LRU.  An  engineer  checking  out  auto-pilot 
control  laws  may  require  a  high-fidelity  auto-pilot  naiodd 
with  its  necessarily  higher  complexity  wherras  an  engineer 
checking  engine  performance  needs  only  a  simple  auto¬ 
pilot  model  to  allow  the  airplane  to  fly  an  acceptable  flight 
path.  In  any  case,  hardware  or  software  LRUs  must  have 
identical  interfaces  to  ease  the  job  of  reconfiguring  the 
simulation. 

FORTRAN  is  used  to  implement  the  software  LRU 
tasks.  Common  blocks  are  used  to  simulate  the  LRU 
interfaces  and  pass  data  from  one  task  to  another.  To 
facilitate  the  restructuring  and  reordering  of  software, 
common  blocks  are  carefully  partitioned  by  LRU  ouqxit 
This  is  needed  for  modularity;  if  outputs  of  two  tasks  share 
a  common  block  and  the  tasks  are  scheduled  on  different 
processors,  then  one  task’s  data  may  overwrite  the  data 
produced  by  the  other  when  the  common  block  is  copied 
to  shared  memory.  Therefore,  common  blocks  have  a 
single  writer  and  multiple  readers.  Common  blocks  reside 
in  the  local  memory  of  the  processor  running  the  LRU 
task.  If  data  in  a  common  block  is  needed  by  a  task  on 
another  processor,  the  common  block  is  written  to  shared 
memory  and  read  by  the  task  needing  the  data.  The  cost  to 
produce  or  consume  a  common  blodc  is  determined  by  the 
size  of  the  block,  plus  some  additional  overhead.  A  real¬ 
time  executive  on  each  processor  coordinates  the  tasks  by 
consuming  data,  executing  the  tasks,  and  prodiKing  data. 

In  addition  to  running  the  real-time  executive,  the  host 
computer  runs  LRU  tasks,  the  airplane  simulation,  and 
records  data  in  real-time.  Ihis  is  a  huge  load  on  resources, 
and  improvements  in  computing  capacity  are  always 
quickly  used  up  by  expanding  simulation  requirements. 
The  use  of  multi-processor  computers  has  provided  a 
substantial  improvement  in  computing  capacity,  but  at  the 
same  time  has  greatly  increased  the  complexity  of 
effectively  managing  the  computing  environment. 

Scheduling  of  tasks  onto  processors  is  complicated  by 
the  presence  of  hard  real-time  constraints  in  the  form  of 
processor  frame-times  and  latency  requirements.  A 
processor  frame-time  constraint  states  the  period  with 
which  a  processor  will  cycle  through  its  assigned  tasks. 
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This  will  Umit  the  number  of  tasks  that  can  be  assigned  to 
the  same  processor.  Latency  requirements  apply  to  data 
paths  thn^h  a  series  of  alternating  tasks  and  common 
blocks  and  state  that  a  computation  (represented  by  the 
path)  must  be  completed  within  a  specified  amount  of 
time.  Latency  calculations  must  not  cnily  take  into  account 
the  execution  time  of  the  tasks  but  also  the  costs  of 
copying  common  blocks  to  and  from  memory  when 
necessary  and  the  relative  execution  order  of  the  tasks 
involved. 

2i  Simulated  Annealing 

The  {xoblem  has  a  large  solution  space  (P*^  possible 
solutions,  where  P  is  the  number  of  processors  and  T  is  the 
number  of  tadcs)  [8].  An  optimal  solution  is  one  for  which 
no  frame-time  or  latency  are  vidated.  Existing  scheduling 
approaches  have  some  deficiencies  when  applied  to  this 
situation  [1, 2, 6, 7].  Specifically,  there  are  th^  points  to 
consider.  First,  a  pre-runtime  or  static  scheduling 
approach  is  need^  to  ensure  that  all  constraints  will  be 
satisfied  all  of  the  time.  Second,  processor  utilization  near 
100%  must  be  achievable.  Third,  latency  constraints  must 
be  siqjported  (note  that  these  stretch  across  a  sequence  of 
tasks  and  cannot  be  expressed  as  simple  task  deadline 
constraints).  These  three  characteristics  lead  us  to 
investigate  heuristic  approaches  to  solving  this  scheduling 
problem. 

One  possible  heuristic  approach  would  attempt  to 
mimic  the  decisions  of  the  software  engineer  faced  with 
the  same  problem.  The  typical  approach  to  allocate  tasks 
in  a  multi-processw  environment  starts  with  a  uni-process 
simulation.  The  engineer  partitions  the  simulation  into 
functional  units,  relying  on  knowledge  about  the  data  flow 
d'  the  LRU  tasl^  and  the  frame-times  necessary  to  ensure 
the  individual  tasks  will  support  the  time  dependent 
computations,  such  as  physical  control  laws.  When  LRU 
tasks  pass  data  to  each  other  sequentially,  the  best  solution 
is  to  order  them  sequentially  in  a  processor  to  minimize 
data  latency.  Problems  arise  when  data  flow  is  not  clearly 
sequential,  or  when  a  single  processor  doesn't  have  enough 
processing  power  to  execute  such  a  set  of  sequential  tasks. 
The  engineer  is  faced  with  making  a  best  guess  and  then 
analyzing  the  allocatimi  to  see  if  it  meets  the  requirements. 
Processor  and  LRU  task  frame-times  are  emphasized  over 
data  latency  because  the  frame-time  is  often  a  less  flexible 
requirement  for  LRU  tasks,  and  also  because  it  is  much 
easier  for  the  engineer  to  measure  and  analyze  than  data 
latency.  All  the  engineer  needs  to  estimate  execution  time 
are  the  individual  execution  times  of  the  tasks.  To 
estimate  data  latency,  {xocessor  frame-times  and  data  flow 
information  are  needed.  When  an  allocation  is  made  to 
multiple  processors,  the  data  flow  must  be  traced  from 
task  to  ta^  and  processor  to  processor  in  order  to  sum  the 
total  latency  as  outlined  in  the  previous  section.  Minor 
changes  in  allocation  can  have  large  effects  on  these 


calculations  thus  making  the  problem  difficult  to  solve 
manually. 

Algorithmic  aj^Koaches  have  the  same  problems.  It  is 
fairly  easy  to  meet  the  frame-time  requirements,  but 
difficult  to  meet  the  data  latency  requirements.  To  solve 
the  frame-time  requirements,  a  bin-packing  algorithm  can 
be  used.  A  near  optimal  solution  for  frame-times  is  easily 
obtained.  However,  with  a  fairly  complex  data  flow,  an 
algorithm  to  minimize  data  flow  latency  is  not  readily 
apparent.  This  is  further  complicated  when  processors 
have  difiering  frame-times  and  are  asynchronous.  It  is 
difficult  to  capture  the  decisions  made  by  the  engineer  to 
minimize  daU  latency  and  would  certainly  make  a 
challenging  A1  project.  Fortunately,  there  are  simpler 
altenuitives. 

Simulated  annealing  is  an  algorithm  that  has  been 
successfully  applied  to  another  difficult  allocation 
problem,  the  optimization  of  VLSI  component  placement 
[3. 4.  S].  The  algorithm  is  meant  to  mimic  the  annealing 
process  of  forming  a  crystal.  A  solution  is  heated  to  a 
point  where  molecules  are  moving  randomly  and  not 
bonding  togeth^.  Then  the  solution  is  slowly  cooled,  and 
the  molecules  begin  sticking.  If  the  location  is  a  strong 
bond,  a  molecule  is  unlikely  to  move  again.  If  the  location 
is  a  weak  bond,  a  molecule  will  probably  come  loose  to  try 
and  find  a  better  location.  As  the  solution  cools,  a  cryst^ 
structure  is  formed  by  molecules  sticking  to  the  locations 
with  strong  bonds. 

The  algorithm  is  very  simple  and  relatively  easy  to 
implement.  The  basic  idea  is  to  generate  random  moves 
(in  this  case,  moving  a  task  to  a  different  processor  or 
another  position  in  the  execution  order  on  the  same 
processor)  and  then  evaluate  the  allocation.  The 
evaluation  is  based  on  the  frame-time  and  data  latency  of 
the  processors.  If  the  new  allocation  is  better,  the  move  is 
always  accepted.  If  the  move  is  not  better,  the  move  is 
accept  or  rejected  based  on  a  probabilistic  function  of  two 
terms,  the  current  temperature,  and  the  difference  between 
the  new  allocation’s  evaluation  and  the  old. 

The  temperature  is  determined  by  a  cooling  schedule. 
The  cooler  the  temperature,  the  less  likely  it  is  that  a  bad 
move  will  be  accepted.  The  cooling  schedule  begins  with 
a  high  temperature,  where  almost  all  bad  moves  are 
accqjted  and  cools  to  a  temperature  where  almost  no  bad 
moves  are  accepted. 

The  key  to  this  algorithm  is  in  the  probabilistic 
acceptance  of  bad  moves.  A  problem’s  solution  space 
typic^ly  has  many  hills  and  vsdleys  when  evaluating  all 
the  possible  configurations.  If  only  good  moves  are 
accepted,  that  is,  those  that  improve  the  evaluation,  then 
movement  within  the  solution  space  will  always  be 
downhill.  Starting  from  a  random  point  in  the  solution 
space,  the  best  possible  solution  would  be  the  lowest  point 
reachable  without  going  uphill.  The  problem  is  the  best 
solution  might  be  just  over  the  next  hill.  Allowing  bad 
moves  gives  the  algorithm  the  oj^rtunity  to  climb  the  hill 
and  find  the  better  solution  on  die  other  side.  Algorithms 
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with  this  property  have  been  termed  probabilistic  hill¬ 
climbing  aigorithi^ 

3:  Results 

The  simulated  annealing  algorithm  has  been 
implemented  on  an  HP-^)ollo  workstation  using  the  C 
progranuning  language.  In  this  section,  we  describe  the 
results  obtained  on  a  realistic  example.  The  annealing 
program  ouqxits  representations  of  the  allocations  at  every 
step  in  the  coding  schedule,  including  the  beginning  and 
end.  An  examine  allocation  is  given  in  Table  I. 

The  rows  represent  the  order  of  tasks  assigned  to  a 
processor,  followed  by  two  numbers  separated  by  a  colon. 
The  number  on  the  left  is  the  specifled  frame-time  of  the 
processor,  which  is  equivalent  to  the  smallest  frame-time 
of  the  tasb  allocated  to  the  processor.  The  number  to  the 
tight  is  the  actual  execution  time  of  the  tasks  allocated  to 
die  processor,  along  with  the  time  needed  to  read  and  write 
conunon  blodrs  to  shared  memory.  If  the  acuial  execution 
time  of  a  processtv  is  larger  than  the  [Hocessor’s  frame¬ 
time,  a  fn^-time  overflow  occurs.  The  amount  of  the 
ovnflow  is  the  execution  time  minus  the  frame-time. 

The  evaluation  of  the  allocation  is  given  by  the  three 
numbers  below  the  allocation  table.  Frame-time  and 
latency  overruns  are  computed  similarly  by  first  finding  all 


violated  constraints.  The  frame-time  overrun  is  then  the 
sum  of  all  the  individual  processor  frame-time  overruns 
for  those  processors  with  frame-time  violations.  The 
latency  overrun  is  generated  by  summing  the  amount  by 
which  latency  constraints  are  violated  for  those  constraints 
that  are,  in  fact,  unsatisfied.  The  “eval”  value  is  computed 
as  follows;  (latency  ovorun  *  weight.factor)  +  frame-time 
overrun.  The  weight  factor  is  an  input  parameter. 

The  best  possible  value  for  the  evaluation  is  zero.  As 
long  as  all  frame-time  and  latency  constraints  are  met,  all 
solutions  are  equal  as  far  as  the  evaluation  is  concenied. 
The  weight.factor  weights  the  latency  constraints  versus 
the  frame-time  constraints.  The  latency  overruns  typically 
produce  much  larger  numbers  than  the  frame-time 
overnms  due  to  the  way  the  overruns  are  computed,  so  the 
latency  overruns  are  scaled  down  by  setting  weight.factor 
to  values  between  zero  and  one.  Also,  in  the  real-time 
environment  of  aircraft  simulation,  frame-time  constraints 
are  more  critical  than  latency  constraints,  and  the 
multiplier  helps  push  overruns  to  the  latency  evaluation. 

The  annealing  program  was  run  on  an  example  having 
60  tasks,  18(X)  common  blocks,  and  4  latency  constraints, 
targeting  a  system  with  6  processors.  The  annealing 
algorithm  was  executed  92000  times  over  S  temperatures, 
taking  3S  minutes  on  an  HP  42S.  Sample  output  is  given 
in  Table  II. 


1  Processor 

Tasks  allocated  to  the  processor  (in  tatter) 

Frame-time:  exec,  time 

t37  tS3  t50  t39  t28  tSl  t5 

20:15.124* 

1  procl 

t29  t6  t42  t57  tl9  t27  t22  t26  t56  tl 

20:15.1875  * 

tl8  t41  t33  t36  t34  t20  tl2  t30  t4 

20:24.1515  * 

HSiSlHi 

t31  t32  t44  tl3  til  t55  t38  tl4  tO 

20:22.171  • 

1  proc4 

tlO  t40  t54  t52  t8  t48  t23  tl7  t45  t24  t49  t58  t9  t3 

20:42.2305  * 

t43  t46  t25  t21  t35  t7  tl5  tl6  t47  t59  t2 

20:16.2* 

latency  overrun  =  620  frame-time  overrun  =  28.553  eval  =  90.553 


Table  /.  Sample  Task  Allocation 


1  Processor 

Tasks  allocated  to  the  processor  (in  order) 

Frame-time:  exec,  time 

tlO  t55  tl  1 130 14  tl2 131 13  t59 15 

20:19.98 

tO  t51  tl  t34  tl6 135 149  tl3 136  G3 

20:19.07 

1  i»roc2 

t22 127 125 16  t7  t38 120  tl9 126 121 123 

20:19.575 

t8  t48 124  tl8  tl4 158  tl5 146 128 

40:39.565 

|proo4 _ 

t40  tl7 154 144 156 19  t57 145 12  t32 

20:19.7 

t52 153 129 141 143 137 139 150 142 147 

20:19.205 

latency  ovmun  s  0  frame-time  overrun  =  0  eval  =  0 


(seed=l,  weight  factor  =  0.1) 

Table  //.  Sample  Output  from  Annealing  Program 
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The  cooling  achedule  for  this  example  was  arrived  at 
by  experimeiitation.  The  schedule  has  five  temperatures, 
100.  10,  1.0,  0.1,  and  0.01.  Using  an  exponentially 
decreasing  schedule  is  typical  of  annealing 
implementations.  Initially,  a  much  wider  range  of 
temperatures  was  used,  from  1000  to  0.001.  By  studying 
the  evaluations  at  the  extremes  of  the  temperature 
schedule,  the  range  was  narrowed  with  no  effect  on  the 
quality  of  the  final  solution.  The  uppa  range  was  lowered 
to  100  because  at  temperatures  of  10  and  above,  task 
allocations  were  seemingly  random.  The  100  was  left  in 
the  schedule  to  ensure  that  the  algorithm  covered  a  large 
portion  of  the  solution  space.  In  this  example,  the 
algorithm  converged  on  a  solution  at  temperature  0.1. 
Tto  is  not  always  the  case.  In  about  20%  of  the  trials, 
there  was  a  small  frame-time  overrun  at  temperature  0.1 
which  was  erased  by  the  iterations  at  temperature  0.01. 
Since  an  evaluation  of  zero  was  produced  with  the 
temperature  of  0.01,  there  is  no  need  to  use  a  lower 
temperature.  In  no  case  did  an  evaluation  ever  increase 
once  it  went  to  zero  at  the  0.01  temperature.  The  number 
ot  iterations  at  each  step  in  the  cooling  schedule  were 
similarly  derived  by  experimentation.  The  cooling 
schedule  used  in  the  atove  example  appears  to  be 
adequate  on  all  our  experiments  to  dale.  However,  given 
the  small  running  time  of  the  algorithm,  it  is  conceivable 
that  the  number  can  be  tuned  for  each  problem  if  many 
new  simulation  configurations  (using  different  LRUs)  are 
to  be  implemented. 

The  process  of  tuning  the  schedule  and  weight  factor  is 
problem  dependent,  but  a  schedule  and  weight  factor  that 
works  for  a  given  {Hoblem  will  probably  work  well  for 
similar  problems.  The  most  important  factors  are  the  size 
of  the  problem  in  terms  of  tasks  and  common  blocks,  the 
numbm'  and  complexity  of  constraints,  and  the  number  of 
processors  available.  Also,  the  example  used  for  Table  II 
is  tightly  constrained  in  terms  of  frame-time.  No 
ixocessor  has  more  than  S%  idle  time.  It  seems  obvious 
that  if  a  given  schedule  finds  a  solution  with  a  tightly 
constrained  problem,  it  will  also  work  on  problems  less 
tightly  constrained.  Conversely,  the  mtxe  constraints  on  a 
problem,  the  more  iterations  will  be  required  to  solve  it.  if 
it  can  be  solved  at  all. 

Since  the  algtMithm  is  dependent  on  the  number  of 
iterations,  the  performance  of  the  algorithm  is  critical. 
After  some  coding  improvements,  the  runtime  of  a  smaller 
example  was  reduced  from  10  minutes  to  a  little  over  3 
minutes.  The  runtime  of  the  larger  example  above  was 
dominated  by  the  computations  related  to  the  large  number 
of  ctxnmtxi  blocks  and  scaled  linearly  with  the  number  of 
blocks  and  constraints  in  the  input.  This  was  verified  by 
adding  large  numbers  of  artificial  latency  constraints.  A 
profiling  utility  was  used  to  identify  the  most  heavily  used 
routines  and  modifications  to  four  routines  (all  in  the 
annealing  evaluation  routing)  resulted  in  the  performance 
imixovement.  The  improvements  in  efficiency  were 


almost  entirely  derived  from  substituting  C  language 
pointer  addressing  and  operations  for  indexed  arrays. 

4:  Conclusion 

The  problem  of  allocating  tasks  in  a  real-time  multi¬ 
processing  environment  to  simulate  modem  aircraft  is 
intractable,  but  has  a  suitable  heuristic  solution  using  the 
simulated  annealing  algorithm.  The  problem  has  been 
formalized  with  a  set  of  equations  ttefining  processcx 
frame-time  and  data  latency  as  functions  of  task  allocation. 
The  equations  are  directly  used  to  define  the  evaluation 
portion  of  the  simulated  annealing  algorithm.  The 
algorithm  was  run  on  a  large  example  based  on  existing 
real-time  aircraft  simulations  and  shown  to  be 
^proximately  linear  in  the  number  of  common  blocks  and 
l^ncy  constraints.  Several  input  parameters  have  to  be 
fine-tuned  experimentally  to  yield  good  results,  but  the 
execution  sp^  of  the  algorithm  (approximately  35 
minutes  as  opposed  to  several  days  or  weeks  for  hand 
methods)  is  fast  enough  to  support  repeated  trials  and  be 
effectively  applied  in  the  Boeing  Flight  Systems 
Laboratory. 
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Abstract 

The  navigation  payload  software  for  the  next  block  of 
Global  Positioning  System  satellites  recently  completed 
testing.  The  computer  program  for  the  onboard  computer 
is  sufficiently  complex  to  expose  almost  every  issue  that 
has  been  put  forward  in  rate  monotonic  theory.  The 
success  of  this  effort  demonstrates  the  utility  of  the  theory 
in  this  type  of  application.  The  system  designed  required 
the  processor  to  perform  a  highly  diverse  set  of  hard 
deadline  real-time  functions.  This  design  would  have  been 
difficult  or  impossible  prior  to  the  development  of  rate 
monotonic  theory.  The  use  of  utilization  bounds  has 
important  advantages  from  a  software  engineering  point  of 
view.  The  problems  of  insuring  schedulability  over  the 
course  of  development  and  verifying  the  schedulability  of 
the  finished  system  are  discussed. 

Bacl^round 

Rate  monotonic  scheduling  theory  has  been  successfully 
applied  to  the  development  and  testing  of  complex  real 
time  software  for  an  embedded  space  vehicle  application. 
The  project  development  occurred  over  the  same  period  of 
time  that  much  of  the  material  on  rate  monotonic 
scheduling  was  being  published.  Thus,  it  was  an 
opportunity  to  apply  these  ideas  very  soon  after 
publication  to  a  large  project  with  cost  and  schedule 
commitments.  Because  the  system  design  required  the 
processor  to  perform  such  a  complex  mix  of  functions,  it 
depended  heavily  on  the  application  of  this  theory. 

The  software  was  developed  for  the  Navigation 
Payload  in  the  NAVSTAR  Global  Positioning  System 
(GPS)  Replenishment  Satellites  program.  These  satellites 
will  be  launched  during  the  late  1990's  as  the  current 
generation  of  GPS  satellites  is  retired.  The  new  satellites 
have  significantly  greater  functionality  than  their 
predecessors. 

Work  began  on  this  project  in  the  spring  of  1987  and 
completed  final  qualification  testing  in  the  fall  of  1993. 


Hardware  production  and  related  woric  will  extend  into 
the  1990's.  The  initial  phase  of  development  was  a 
competitive  contract  that  led  to  the  award  of  the  current 
development  contract.  In  the  initial  phase,  a  Inassboard 
prototype  was  developed  which  supported  a  core  subset 
of  functions  for  the  Navigation  Payload.  The  prototype 
provided  a  proof  of  concept  and  some  execution  time 
benchmaiks  without  which  the  risk  of  the  ensuing 
development  would  have  been  unacceptable.  Full  scale 
development  was  completed  on  the  next  phase.  Although 
some  time,  perhaps  a  year,  can  be  attributed  to  the 
transition  between  two  contracts,  the  extended 
development  time  reflects  the  scope  and  complexity  of 
the  effort. 

The  Global  Positionins  System 

The  Global  Positioning  System  provides  navigation 
and  time  signals  to  suitably  equipped  users  on  a  global 
basis  for  position,  velocity  and  time  determination.  It  also 
provides  a  nuclear  event  detection  capability. 

GPS  is  composed  of  a  User  Segment,  a  Control 
Segment  and  a  Space  Segment.  The  User  Segment 
consists  of  user  navigation  receivers  that  receive  and 
process  the  satellite  downlink.  The  Control  Segment 
consists  of  ground  stations  responsible  for  monitoring  the 
space  vehicles,  supplying  the  space  vehicles  with  updated 
Navigation  Data,  and  ensuring  proper  space  vehicle 
operation.  The  Space  Segment  is  a  constellation  of 
twenty -four  space  vehicles  in  six  orbital  planes,  providing 
twenty-four  hour  coverage  worldwide.  On  the  space 
vehicle,  the  Navigation  Payload  performs  the  Space 
Segment  role  in  the  navigation  mission  and  provides 
communications  for  the  nuclear  detection  mission. 

Because  these  satellites  are  used  for  aircraft  and  other 
navigation,  the  integrity  and  reliability  of  the  system  are 
critical.  If  the  satellite  were  to  broadcast  erroneous  signals 
or  data,  the  consequences  could  be  disastrous.  In  this  type 
of  hard  deadline,  real-time  system,  failure  of  a  task  to 
complete  in  time  could  have  unpredictable  results.  The 
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hudwtre  often  enters  an  undefined  or  undesirable  state  if 
the  software  fails  to  meet  a  deadline.  It  was  therefore 
critical  to  insure  the  reliability  of  the  real  time  design. 

The  Navigation  Payload  Mission  Processor 

Within  the  Navigation  Payload,  the  Mission  Data 
Unit  (MDU)  serves  as  a  central  interface  point  to 
numerous  other  space  vehicle  subsystems.  These  include 
the  Telemetry.  Tracking  and  Control  System,  the  Nuclear 
Detection  System,  the  Spacecraft  Processing  Unit,  an 
inter-satellite  crosslink  for  communications  and  ranging, 
and  an  L-band  downlink.  The  MDU  also  contains 
^cialized  hardware  for  timing  control,  modulation 
control,  navigation,  and  communications  security.  The 
MDU  includes  the  flight  software  which  is  the  subject  of 
this  paper.  This  is  an  Ada  program,  supported  by  the  EDS 
Scicon  XD  Ada  cross-compiler  and  linker.  The  embedded 
MDU  processor,  called  the  Mission  Processor,  is  a 
Marconi  MAS281  which  is  an  implementation  of  the 
MIL-STD-17S0A  architecture  with  memory  mapping. 

The  computer  program  in  the  Navigation  payload  is 
large  and  complex  compared  to  other  spacecraft  software. 
It  performs  a  wide  variety  of  loosely  coupled  functions. 
There  are  numerous  communication  functions  which 
involve  data  buffering,  bit-packing,  forward  error 
correction  coding,  error  detection  and  cryptography.  In 
addition,  the  software  performs  a  phase  locked  loop  that 
maintains  the  highly  accurate  timing  signal  to  the 
navigation  users.  The  requirement  to  function 
autonomously  for  180  days  dictates  that  the  spacecraft  will 
perform  many  computationally  intensive  functions  which 
are  currently  performed  by  the  Control  Segment.  These 
include  monitoring  the  integrity  of  the  navigation  and 
timing  information,  estimating  the  satellite's  orbital 
parameters  from  inter-satellite  range  measurements,  and 
maintaining  the  synchronization  of  the  GPS  constellation 
using  the  inter-satellite  crosslink. 

The  software  deadlines  result  from  the  many  real-time 
interfaces  that  must  be  serviced  continously.  The  I/O 
architecture  includes  twenty-seven  interfaces  which  use 
nineteen  interrupts  with  rates  up  to  one  kilohertz.  All  six 
spare  1750A  interrupts  are  used,  three  of  which  are 
multiplexed.  Thirty-two  I/O  ports  are  allocated  to  the 
register  I/O  address  space,  and  twenty  I/O  ports  to  a 
memory  mapped  I/O  page  in  one  of  the  address  states. 

The  processor  memory  architecute  requires  the  program 
to  be  partitioned  into  address  states  which  have  separate 
logical  adress  spaces.  The  Navigation  Payload  software 
maps  five  17S0A  logical  address  states  to  448K  16-bit 
words  of  physical  memory.  Additional  memory  is 


allocated  to  store  data  received  from  the  Control 
Segment.  The  XD  Ada  runtime  system  requires  a  separate 
Ada  main  program  in  each  address  state.  Global  shared 
data  provides  inter-state  communication  and  the  XD  Ada 
implementation  of  the  classical  semaphore  provides 
inter-state  synchronization. 

Perhaps  the  roost  distinguishing  feature  of  the 
software  design  is  that  there  are  fifty-one  tasks.  Some 
Ada  designs  may  have  a  large  number  of  tasks  which 
serialize  access  to  data.  Although  this  is  the  case  with 
some  tasks,  most  of  these  tasks  manage  asynchronous 
activities  with  deadlines.  Five  tasks  are  XD  Ada  direct 
interrupt  handlers.  Two  additional  interrupt  handlers  are 
the  only  software  coded  in  assembly  language.  Other 
tasks  are  allocated  as  application  fuocessing  threads  or  as 
servers.  Tasks  have  deadlines  ranging  from  a  few 
milliseconds  to  several  minutes.  Because  the  Mission 
Processor  is  required  to  resume  operation  after  a 
processor  reset,  some  checkpointed  data  are  not  altered  by 
Ada  elaboration.  Initialization  following  Ada  elaboration 
and  task  activation  is  distributed  over  the  tasks.  The  XD 
Ada  runtime  ^stem  provides  task  scheduling  in  the  Ada 
preemptive  model  without  time  slicing.  Interrupt  handlers 
may  propagate  events  through  rendezvous  or  shared  data. 
Lower  rate  tasks  schedule  themselves  by  delay  statements 
or  by  rendezvous  with  a  higher  priority  task  which  hands 
off  a  large  computational  job  with  a  later  deadline. 
Servers  wait  for  a  rendezvous  with  a  client.  From  the 
point  of  view  of  data  flow  and  functional  processing,  the 
software  contains  several  pipelines  with  delays 
corresponding  to  rate  monotonic  schedulmg  deadlines. 

Since  this  large  number  of  tasks  incurs  penalties  in 
memory  usage,  runtime  overhead,  and  the 
comprehensibility  of  the  system,  the  design  was  often 
reviewed  to  see  if  the  number  of  tasks  could  be  reduced. 
Most  of  the  shared  data  structure  tasks  have  been 
optimized  away.  In  the  final  design,  only  a  small 
reduction  would  be  possible  at  the  expense  of 
significantly  poorer  modularity.  Insuring  schedulability  of 
such  a  large  number  of  tasks  was  a  significant  problem. 

Although  much  detail  has  changed  since  the 
prototype,  the  basic  software  architecture  and  use  of  Ada 
tasking  have  not.  One  source  of  change  was  the  evolution 
of  many  functional  and  interface  requirements,  another 
was  a  change  of  compiler  vendors.  Development  began 
with  a  compiler  that  used  the  Ada  rendezvous  as  the 
means  of  inter-address  state  conununication  and 
synchronization.  In  early  1991,  the  program  transitioned 
to  the  XD  Ada  compiler.  The  difference  in  Interrupt 
models  induced  changes  in  interrupt  handlers.  Interrupts 
are  now  bandied  as  described  in  reference  (3].  The 
alternate  methods  for  dealing  with  1750A  memory 
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mapping  caused  substantial  task  re-allocation,  including 
proliferation  of  agent  tasks  to  transport  service  requests 
across  address  states  between  clients  and  servers. 

Software  development  followed  substantially  the 
methodologies  described  in  reference  [10],  In  addition, 
unbounded  priority  inversion  was  prevented  by  design 
practices  similar  to  those  described  in  [11],  (4),  [5]  and 
16]. 

Insuring  Schedulability  throughout 
Development 

Reference  [4]  gives  two  groups  of  techniques  for 
guaranteeing  schedulability.  The  first  group  of  techniques, 
derived  from  references  [7]  and  [11],  is  based  on 
computing  utilization  bounds.  The  second  group  of 
techniques  verifies  schedulability  by  determining  the 
reqwDse  time  of  each  task.  In  these  techniques,  there  is  a 
tradeoff  between  complexity  and  pessimism.  The  simplest 
techniques  tend  to  yield  overly  pessimistic  results.  As  the 
techniques  become  more  complex,  they  yield  increasingly 
realistic  results.  All  of  the  techniques  for  determining 
response  times  are  more  difficult  and  require  more  data 
than  the  techniques  for  computing  utilization  bounds. 

Application  of  the  most  accurate  techniques  to  a 
system  of  this  complexity  would  be  far  too  costly  and 
would  have  delayed  the  schedule,  hi  addition,  since 
execution  time  budgets  must  be  continuously  monitored, 
it  is  important  that  the  data  required  to  do  this  monitoring 
not  be  too  complicated.  The  most  economical  analysis  will 
do  just  enough  to  prove  schedulability  and  no  more. 
Rate  monotonic  analysis  requires  a  global  view  of  the 
entire  processor.  However,  good  design  dictates  that  the 
program  be  decomposed  such  that  each  piece  can  be 
designed  with  limited  knowledge  of  the  other  pieces.  On 
a  complex  system,  it  is  important  to  be  able  to  deal  with 
the  global  properties  of  the  system  at  a  higher  level  of 
abstraction.  Utilization  bounds  are  in  the  spirit  of 
information  hiding  in  that  only  a  small  amount  of 
infomuition  about  each  task  need  be  exposed.  Information 
hiding  has  been  shown  to  be  an  important  element  of 
software  productivity  [1].  This  "separation  of  concerns"  is 
also  discussed  in  [1 1].  As  a  result  of  these  considerations, 
only  utilization  bound  techniques  were  employed. 

Execution  time  budgets  were  frequently  exceeded  at 
every  stage  of  the  development  cycle.  This  required 
reallocating  budget  or  performing  some  optimization  or 
both.  When  an  optimization  is  required,  an  abstract,  global 
view  of  schedulability  is  essential  in  deciding  which  of  the 
potential  optimizations  will  best  address  the  problem.  The 
fact  that  optimization  was  needed  so  frequently  points  out 


the  importance  of  evaluating  execution  time  estimates  and 
measurements  as  they  become  available. 

Utilization  was  continuously  updated  as  the  design 
matured  and  execution  time  estimates  became  more 
accurate.  In  the  early  phases,  a  single  utilization  bound 
was  used  to  assign  initial  execution  time  budgets.  Since 
processing  times  always  seem  to  increase  as  a  software 
project  progresses,  the  fact  that  a  single  utilization  is 
conservative  is  an  advantage.  As  the  design  matured,  the 
single  bound  technique  yielded  too  pessimistic  a  result 
and  more  complex  techniques  were  employed. 

Use  of  a  single  utilization  bound  requires  that  all 
tasks  are  strictly  periodic.  If  events  have  a  burst  rate 
higher  than  the  average  rate,  this  shortest  period  must  be 
used  in  the  analysis.  If  the  deadline  is  less  than  one  full 
period,  you  must  either  use  a  lower  bound,  inflate  the 
execution  time,  or  assume  a  period  for  the  task  equal  to 
the  worst  case  deadline.  Because  of  these  considerations, 
the  utilization  that  is  computed  for  a  single  bound 
comparison  is  much  higher  than  what  is  actually  *he  case. 

Because  the  Mission  Processor  has  many  cases  of 
both  shorter-than-period  deadlines  and  high  burst  rates 
that  are  significantly  higher  than  the  average  rate,  this 
system  would  not  meet  the  single  bound  criteria.  It  was 
necessary  to  use  multiple  bounds.  Reference  [4]  describes 
use  of  utilization  bounds  for  each  event  when  the 
deadlines  are  within  the  period.  The  utilization  of  task  i, 
f; ,  is  computed  as  follows; 


'  JeHn  Tj  T,  ‘  keHr^ 


yhere  C  is  compute  time,  T  is  the  period,  B  is  blocking, 
Hn  is  the  set  of  tasks  with  a  shorter  period  than  task  i, 
and  HI  is  the  set  of  tasks  that  must  execute  in  order  for 
task  i  to  complete.  This  utilization  is  then  compared  to 
the  appropriate  bound. 

A  major  simplification  in  the  use  of  this  technique  is 
that  it  is  not  necessary  to  perform  this  for  every  event.  At 
the  expense  of  being  slightly  more  conservative,  the 
equation  above  can  modified  to  compute  utilization  for  a 
set  of  tasks.  Tasks  can  be  divided  into  sets  that  span 
ranges  of  rates  and  utilization  can  be  computed  for  each 
set.  With  a  small  number  of  sets,  most  of  the  simplicity 
of  a  single  bound  is  retained.  Selecting  the  sets  carefully 
eliminates  the  large  discrepancy  between  computed  and 
actual  utilization  which  occurs  with  a  single  bound.  For 
most  of  the  design  process,  just  two  bounds  were  used. 
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This  was  later  extended  to  three.  Utilization  for  a  set  of 
tasks,  H2,  can  be  expressed  as: 


/_  r  5._i_ 

^  JeHriTj  mln(r^ 

keH2 


Comparing  this  utilization  to  the  appropriate  bound  can 
then  prove  the  schedulability  of  the  set  of  tasks.  The 
periods,  T,,  ,  in  H2  are  the  burst  rate  periods  but  the 
periods,  Tj  ,  in  Hn  are  the  average  over  minCT^  ).  This 
eliminates  the  excessive  conservatism  of  a  single  bound. 

Verifying  Schedulability  of  the  Finished 
System 

To  some  extent,  the  schedulability  of  a  system  is 
demonstrated  by  the  various  functional  tests.  However,  on 
a  complex  system  it  is  not  practical  to  figure  out  how  well 
the  test  cases  prove  that  system  is  schedulable  in  every 
possible  condition.  Therefore,  a  final  analysis  was 
performed  with  measured  execution  times.  On  a  fmished 
system,  there  is  the  option  of  measuring  the  response 
times  directly.  This  was  considered  but  there  were 
instrumentation  problems  with  several  events.  Therefore, 
the  same  utilization  bound  technique  that  was  used  during 
development  was  used  in  the  test.  This  required  accurate 
execution  time  measurements. 

The  first  attempt  was  to  use  end-to-end  execution 
times  as  though  they  were  the  actual  times.  Test 
scenarios  were  constructed  which  attempted  to  minimize 
the  activity  which  would  preempt  the  task  that  was  being 
measured.  The  HP  Software  Performance  analyzer  was 
used  to  measure  the  time  between  a  synchronization  event, 
such  as  a  delay  expiration  or  interrupt,  and  the  completion 
of  processing.  This  device  builds  a  histogram  of  durations 
which  occur  between  two  events.  From  this,  the  longest 
response  time  in  a  test  scenario  can  be  measured.  The  test 
scenario  is  designed  so  that  it  forces  at  least  one 
occurrence  of  the  longest  execution  path  of  the  task  being 
measured.  However,  the  nature  of  the  ^stem  is  such  that 
preemption  could  not  be  sufficiently  minimized. 

The  XOAda  runtime  system  has  a  pointer  to  the  task 
control  block  of  the  currently  running  task.  This  pointer  is 


changed  if,  and  only  if,  there  is  a  context  switch.  With 
the  HP  State  Analyzer,  we  cpuld  trigger  on  a  desired 
event  and  measure  the  exact  time  between  every  context 
switch  until  the  completion  of  the  response  to  the  event. 
This  had  the  virtue  that  all  runtime  overhead  was 
included  in  the  measurement.  It  also  provided  the 
blocking  time  measurements.  The  disadvantage  is  that 
substantial  knowledge  of  the  program  flow  was  required 
to  unravel  which  task  executions  were  part  of  the 
response.  This  turned  out  to  be  less  difficult  than  it 
appeared  and  it  had  the  unexpected  benefit  that  it 
uncovered  some  design  flaws. 

Conclusion 


This  project  was  a  successful  application  of  rate 
monotonic  theory  to  the  design,  analysis  and  testing  of  a 
complex  real-time  system.  It  would  have  been  very 
difficult,  perhaps  impossible,  to  implement  this  system 
without  preemptive  scheduling.  Throughout  the  project, 
concerns  about  the  use  of  Ada  and  preemptive  scheduling 
in  hard  deadline  systems  were  published  [9],  [8],  [2]. 
There  were  also  reservations  within  our  own  organization. 
None  of  the  difficulties  were  significant  compared  to  the 
advantages  of  the  approach. 

What  is  most  important,  the  methodology  used  makes 
it  possible  to  deal  with  the  overall  schedulability  of  a 
complex  real-time  system  at  a  higher  level  of  abstraction 
while  maintaining  the  loose  coupling  and  decentralized 
design  of  the  pieces. 

Several  major  issues  were  only  touched  on  in  this 
paper.  The  design  optimizations  which  were  required 
provide  important  insights.  The  hardware/soffware 
tradeoffs  that  lead  to  this  design  suggest  some  trends  in 
real-time  systems.  Some  important  techniques  in  the  use 
of  Ada  were  devised.  Economically  obtaining  execution 
time  estimates  throughout  the  design  was  a  challenge. 
These  may  be  the  subject  of  future  papers. 
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Abstract 

Real-time  systems  manipulate  data  types  with  inherent  timing  constraints.’  Priority-based  scheduling  is  a 
popular  approach  to  build  hard  real-time  systems,  when  the  timing  requirements,  supported  run-time 
configurations,  and  task  sets  are  known  a  priori.  Future  real-time  systems  will  need  to  support  these  hard 
real-time  constraints  but  in  addition  (a)  provide  friendly  user  and  programming  interfaces  with  audio  and 
video  data  types  (b)  be  able  to  communicate  with  global  networks  and  systems  on  demand,  and  (c) 
support  critical  command  and  control  services  despite  potential  risks  introduced  by  such  added  flexibility 
and  dynamics.  In  this  paper,  we  argue  that  temporal  protection  mechanisms  can  be  as  beneficial  in  these 
systems  as  virtual  memory  protection.  The  processor  reservation  mechanism  that  we  have  implemented 
in  Real-Time  Mach,  for  example,  provides  guaranteed  timing  behavior  for  critical  activities. 


1.  Introduction 

In  real-time  systems,  the  correctness  of  a  computation  depends  upon  both  its  logical  and  temporal 
correctness.  As  a  result,  earlier  real-time  systems  were  often  hand-crafted  in  order  to  meet  stringent 
timing  constraints.  Recently,  more  flexible  priority-based  scheduling  approaches  have  become  popular 
[3,6,  13].  However,  the  design  of  such  systems  still  requires  a  priori  knowledge  of  tasks  and  their 
timing  requirements,  and  therefore  these  systems  tend  to  be  very  static  in  nature.  Many  recent  trends 
indicate  that  future  systems  increasingly  need  to  be  much  more  dynamic  in  nature: 

•  Real-time  applications  are  becoming  more  pervasive  and  complex.  For  example,  the  next 
generation  of  naval  systems  are  expected  to  support  many  data  types  including  analog, 
discrete,  graphics,  audio,  video  and  voice,  with  integrated  communications  and  control  with 
low  latency  requirements.  This  trend  has  been  accelerated  by  two  related  mainstream  factors. 

First,  the  explosive  surge  of  multimedia  applications  has  literally  brought  time-critical  data 
types  (audio  and  video)  to  the  desktop.  Secondly,  the  continuing  growth  of  computing  power 
at  ever  falling  prices  enables  more  and  more  applications,  with  multitasking  becoming  a 
natural  candidate  to  use  up  available  cycles. 

•  The  advent  of  high-performance  networks  such  as  ATM  opens  up  new  applications  with  high 
bandwidth,  low  latency,  and  guaranteed  service  requirements.  These  applications  include 
tele-medicine,  distance  learning,  advanced  air  traffic  control,  sophisticated  defense  systems 
and  networked  patient  monitoring.  In  these  applications,  any  overload  conditions  because  of 
dynamic  requests  and/or  connections  must  not  disrupt  their  basic  mission. 

•  There  is  a  rush  towards  universal  connectivity  and  access,  where  a  piece  of  information  (or  a 
person)  is  just  a  call  away.  Such  high-degree  of  connectivity  between  systems  and  networks 
(both  via  cable  and  wireless  networks)  provides  global  access  to  information  databases. 
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authors  and  should  not  be  interpreted  as  representing  official  policies,  either  expressed  or  implied,  of  NRaD,  ONR,  or  the  U.S. 
Government. 
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Future  systems  would  therefore  be  hard  pressed  not  to  exploit  such  information  availability 
and  reachability.  In  addition,  such  information  accesses  may  need  to  be  set  up  dynamically 
on  user  demand. 

The  above  trends  are  expected  to  result  in  flexible,  friendly,  dynamic  and  informative  systems  but  this 
flexibility  and  accessibility  bring  about  new  problems  such  as  issues  of  security  and  privacy.  In  addition, 
real-time  connections  (such  as  multimedia)  must  be  established  online.  These  multimedia  interfaces 
require  real-time  behavior,  but  not  at  the  cost  of  adversely  affecting  those  critical  activities  with  hard 
real-time  constraints.  In  this  paper,  we  argue  that  robust  temporal  protection  mechanisms  are  needed  to 
supplement  and  complement  spatial  protection  mechanisms  in  order  to  achieve  this  goal. 

2.  Protection  Mechanisms  for  Real-Time  Programs 

Early  on,  real-time  systems  were  often  built  without  virtual  memory  protection  because  the  stochastic 
nature,  long  delays  and  potential  overhead  introduced  by  demand  paging  were  considered  incompatible 
with  real-time  requirements.  However,  processors  and  memory  have  become  faster,  and  real-time 
operating  systems  [4,  11,  16,  17]  have  become  more  sophisticated.  Any  associated  overhead  of  virtual 
memory  management  is  now  considered  to  be  worth  the  address  space  protection  enforced  across  process 
boundaries.  With  address  space  protection,  logical  misbehavior  on  the  part  of  one  process  (such  as  the 
use  of  incorrect  address  pointers)  does  not  necessarily  mean  that  an  entire  processor  will  fail.  We  refer  to 
such  protection  mechanisms  as  spatial  protection. 

Traditional  non-real-time  operating  systems  also  provide  a  simple  notion  of  temporal  protection  with 
fairness  as  a  primary  motivation.  The  scheduler  in  time-sharing  systems  typically  uses  a  multi-level 
feedback  queueing  mechanism  such  that  a  process  executing  for  a  long  time  typically  has  its  priority 
lowered  in  order  to  let  other  waiting  processes  execute  f5].  Hence,  even  a  process  which  enters  an  infinite 
loop  can  normally  be  stopped  or  killed  when  its  scheduled  time  quantum  expires  and  another  process  is 
scheduled. 

In  real-time  systems,  where  timeliness  of  critical  activities  (and  not  fairness)  is  the  primary  motivation, 
the  issue  of  temporal  protection  needs  to  be  substantially  re-considered.  For  example,  consider  the  use  of 
fixed  priority  scheduling  approaches  such  as  rate-monotonic  scheduling  [7,  15]  in  building  real-time 
systems.  Under  a  fixed  priority  scheduler,  a  process  which  enters  an  infinite  loop  will  preempt  all  its 
lower  priority  tasks,  which  can  then  never  run  again.  Often,  the  only  recourse  is  to  reboot  the  machine. 

Spatial  virtual  memory  protection  has  been  adapted  to  real-time  systems  by  providing  the  ability  to  lock 
down  memory  pages.  Similarly,  it  is  reasonable  to  expect  that  temporal  protection  schemes  need  to  be 
adapted  to  real-time  systems  in  general  and  priority-driven  real-time  systems  in  particular.  We  believe 
that  such  temporal  protection  mechanisms  are  critical  to  support  the  flexible  and  dynamic  application 
environments  described  earlier.  In  the  next  section,  we  consider  temporal  protection  schemes  in  detail. 

2.1.  Temporal  Protection  for  Real-Time  Programs 

Many  real-time  operating  systems  support  a  multitasking  environment  for  its  inherent  modularity,  ease 
of  program  development  and  debugging,  and  programming  (as  well  as  conceptual)  compatibility  with 
traditional  operating  systems  such  as  Unix  [1].  The  timing  behavior  of  a  real-time  process  in  this 
multitasking  environment  depends  upon  its  own  behavior  and  its  level  of  resource-sharing  with  other 
processes.  Resources  shared  by  processes  can  be  either  physical  or  logical.  Physical  resources  shared 
across  processes  include  the  CPU,  buses,  networks,  memory  pages,  memory  heap,  I/O  interfaces  etc. 


Logical  resources  can  include  servers,  shared  queues,  communication  buffers,  etc. 

Scheduling  theory  for  processors  [7],  buses  [14]  and  networks  provides  the  means  to  determine  whether 
a  set  of  tasks  using  a  physical  resource  can  meet  its  timing  requirements.  Similarly,  synchronization 
protocols  [13,  12,  2]  provide  the  ability  to  analyze  the  needs  of  real-time  tasks  to  share  logical  resources. 
However,  these  analytical  techniques  must  necessarily  make  assumptions  such  as  the  worst-case 
execution  time  of  a  task,  the  maximum  duration  of  a  critical  section,  or  the  maximum  bus  transaction 
time.  If  these  assumptions  are  violated,  undesirable  consequences  can  occur.  In  static  systems,  it  may  be 
relatively  easier  to  ensure  at  development  time  that  these  assumptions  are  indeed  satisfied.  However,  as 
real-time  systems  and  applications  become  more  dynamic  and  flexible  in  nature,  the  robustness  of  the 
system’s  ability  to  deliver  its  critical  functionality  may  be  compromised  by  errors  and  violations  in  a 
relatively  new  and  untested  process. 

3.  Guaranteed  Processor  Reservation  for  Real-Time  Programs 

Processor  and  memory  sharing  are  two  critical  pieces  which  can  substantially  affect  (and  dominate)  the 
timing  behavior  of  a  real-time  program.  We  have  been  investigating  an  operating  system  abstraction 
called  processor  reserve  [8,  9]  to  provide  temporal  protection  to  a  real-time  process  at  the  level  of  CPU 
sharing.  In  this  abstraction,  we  view  processor  capacity  as  a  quantifiable  resource  which  can  be  reserved 
like  physical  memory  or  disk  blocks.  A  processor  reserve  represents  a  claim  on  processor  capacity  over 
time  (e.g.  10  ms  of  computation  time  out  of  each  50  ms  of  wall-clock  time).  An  admission  control  policy 
determines  whether  a  reservation  request  is  accepted  or  not,  and  once  the  processor  reservation  is 
established,  it  is  scheduled  and  enforced  by  the  operating  system.  Together,  reservation  and  the 
enforcement  mechanism  provide  a  scheduling  firewall  which  protects  reserved  programs  from  outside 
interference  in  much  the  same  way  as  memory  protection  isolates  a  program  address  space  from  access  by 
other  programs. 

Our  processor  reservation  scheme  has  been  implemented  in  Real-Time  Mach  [17].  Real-Time  Mach 
supports  a  priority-driven  paradigm  to  schedule  real-time  tasks^.  Each  processor  reserve  is  assigned  a 
rate-monotonic  priority  based  upon  its  requested  rate  of  usage  and  the  processor  is  still  scheduled  on  the 
basis  of  fixed  priorities^.  The  reservation  scheme  includes  an  admission  control  policy  to  prevent 
overload  and  a  mechanism  to  accurately  measure  computation  time  consumed  by  programs.  In  addition 
to  measuring  computation  time  usage,  the  reservation  mechanism  enforces  computation  time  limits  over 
the  short-term  in  order  to  ensure  that  a  program  which  attempts  to  use  more  computation  time  than  its 
allocation  does  not  interfere  with  the  timing  behavior  of  other  programs. 

3.0.1.  Experimentation  with  Processor  Reservation 

We  now  describe  an  application  built  on  our  reservation  scheme.  The  application  consists  of  a  number 
of  instantiations  of  a  QuickTime  video  player  [18],  each  of  which  displays  a  video  stream  on  the  screen. 
Each  program  reads  a  short  video  clip  and  then  begins  to  output  frames  to  the  screen  using  a  memory- 
mapped  frame  buffer.  The  video  resolution  is  160x120  with  8  bits  of  color.  The  program  applies  a  noise 
filter  to  each  frame  before  it  is  displayed.  By  itself,  one  instantiation  of  the  program  can  run  at  23.2 
frames  per  second  on  a  486-based  machine. 


^her  CPU  scheduling  policies  such  as  round-robin  are  also  provided  as  dynamic  configuration  options. 
^It  is  relatively  easy  to  extend  this  scheme  to  use  dynamic  priorities  based  on  earliest  deadline  scheduling. 
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When  we  run  two  instantiations  of  the  program  under  a  time-sharing  policy,  each  program  averages 
1 1 .6  frames  per  second.  Under  this  policy,  the  programs  get  variable  service:  first  one  video  stream  gets 
preference  from  the  time-sharing  scheduler  and  then  the  other,  and  they  alternate  getting  better  and  worse 
frame-update  service  as  their  priorities  change  in  the  multi-level  feedback  queue. 

Our  reservation  system  allows  us  to  go  further  in  controlling  the  timing  execution  behavior  of  these  two 
programs.  If  we  consider  one  of  the  programs  as  the  "focus"  in  the  same  way  a  video  teleconference  has 
one  video  stream  as  the  "focus,"  we  can  reserve  more  processor  capacity  for  that  stream  at  the  expense  of 
the  second  stream.  For  example,  when  we  give  one  video  stream  a  reservation  of  80%  of  the  processor 
(e.g.  80  ms  every  100ms)  and  allow  the  other  to  consume  the  remaining  processor  capacity,  we  get  18.6 
frames  per  second  on  the  "focus"  video  stream  and  4.2  frames  per  second  on  the  other  stream. 

3.0.2.  Processor  Reservation  Manager 

Fine-grained  feedback  on  performance  and  the  status  of  the  reservation  can  help  the  application  adapt 
to  its  own  behavior  and  to  the  behavior  of  other  parts  of  the  system.  Also,  reservations  may  be  changed 
by  forces  external  to  the  reserved  program,  and  the  program  must  be  informed  of  the  change  so  that  it  can 
adjust  its  behavior.  A  "reservation  manager"  that  manages  the  reservations  on  a  system  based  on  user 
input  via  a  reservation  user  interface  might  make  such  external  reservation  adjustments. 

3.0.3.  Processor  Reservation  in  Distributed  Systems 

Processor  capacity  reserves  can  support  reservation  in  distributed  real-time  systems  by  having  each 
reserve  contain  reservations  for  various  resources  around  the  distributed  system.  Then  messages 
containing  requests  for  remote  service  will  contain  these  "sub-reserves"  which  can  be  used  to  "charge”  the 
remote  service.  Another  aspect  of  reservation  in  distributed  systems  concerns  the  reservation  of 
communications  protocol  processing  on  each  of  the  hosts  [10].  We  are  currently  investigating  other 
applications  of  processor  reservation  in  user-level  schedulers  and  dedicated  bandwidth  for  critical 
activities. 


4.  Conclusion 

Real-time  systems  need  predictable  timing  behavior,  and  predictability  is  often  achieved  by  exploiting 
the  a  priori  knowledge  of  supported  system  functionality.  These  systems  therefore  tend  to  be  static  in 
nature.  However,  due  to  recent  trends  towards  multimedia  applications,  high-performance  networking 
and  wide  connectivity,  it  can  be  expected  that  future  real-time  systems  will  support  a  highly  dynamic  mix 
of  applications  and  connections.  These  flexible  and  dynamic  systems  can  be  susceptible  to  errors  and 
misbehavior  on  the  part  of  some  task(s)  and/or  network/bus  traffic.  It  is  highly  desirable  that  protection 
mechanisms  be  available  in  these  systems  to  ensure  that  critical  functionality  is  still  provided  by 
preventing  temporal  interference  from  other  activities. 

Address  space  protection  offered  by  virtual  memory  provides  a  logical  fence  between  processes.  We 
similarly  argue  that  temporal  protection  mechanisms  are  also  crucial  fences  that  need  to  be  built  between 
the  timing  behavior  of  Ciitical  real-time  activities.  One  of  our  abstractions  for  such  temporal  protection  is 
called  the  processor  reserve.  This  abstraction  ensures  that  a  real-time  task  is  guaranteed  a  required 
fraction  of  the  processor  at  a  certain  rate.  This  abstraction  has  been  implemented  in  Real-Time  Mach 
where  tasks  with  guaranteed  reservations  are  themselves  scheduled  using  rate-monotonic  priority 
assignment.  We  are  currently  investigating  other  temporal  protection  mechanisms  in  the  management  of 
memory,  display  and  storage. 
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Abstract:  The  design  of  general  purpose  operating 
systems  impose  constraints  on  the  way  one  can  structure 
real-time  t^plications.  This  paper  addresses  the  problem 
of  minimizing  the  end-to-end  latency  of  applications  that 
are  structured  as  a  set  of  cooperating  (real-time)  tasks. 
When  applications  are  structured  as  a  set  of  cooperating 
tasks  the  time  required  for  data  to  progress  from  an  input 
task  to  an  output  task  is  a  function  of  the  number  of  the 
tasks  that  handle  the  data  and  the  deadlines  of  individual 
tasks.  We  present  an  integrated  inter-process  com¬ 
munication  and  scheduling  scheme  that  can  be  used  to 
minimize  the  end-to-end  latency  of  multi -threaded 
applications.  Our  approach  is  to  provide  the  scheduler 
with  information  on  the  inter-process  communication 
interconnections  between  tasks  and  to  use  this  information 
to  guarantee  an  end-to-latency  to  applications  that  is 
simply  a  function  of  the  timing  properties  of  the 
application  and  not  its  task  structure.  This  scheme  has 
been  implemented  within  the  YARTOS  kernel  and  is 
presently  being  ported  to  the  Real-Time  Mach  kernel. 

1 .  Introduction 

Multimedia  applications  that  process  streams  of  live  and 
stored  audio  and  video  are  stimulating  research  on  the 
integration  of  real-time  computation  and  communication 
services  into  general  purpose,  time-shared  operating 
systems.  While  much  is  known  about  the  scheduling  and 
resource  allocation  problems  that  comprise  the  formal  un¬ 
derpinnings  of  such  services,  techniques  for  implementing 
and  using  existing  algorithms,  in  particular  within  the 
context  of  general  purpose  operating  systems,  have 
received  relatively  little  attention. 


*  Suppoi^  in  part  by  grants  from  the  IBM  Corporation,  the  Intel 
Coipontion,  and  the  National  Science  Foundation  (numbers  CCR- 
9110938  and  ICI-901S443). 


In  this  note,  we  describe  a  problem  that  arose  during  the 
implementation  of  an  experimental  desktop  video- 
conferencing  system  [4,  5].  Abstractly,  the  problem  is 
that  of  minimizing  end-to-end  latency  in  real-time 
applications  that  consist  of  a  set  of  cooperating  tasks  or 
threads.  Here  latency  is  defined  as  the  difference  between 
the  times  at  which  input  data  is  first  made  available  to  an 
application  thread  and  the  time  at  which  an  application 
thread  performs  an  output  operation  based  on  the  input 
data.  The  thesis  of  this  work  is  that  by  providing  the 
kernel  with  information  on  the  task  structure  of  real-time 
applications,  one  can  both  dramatically  reduce  the  worst 
case  end-to-end  application  latency  and  employ  relatively 
simple  scheduling  algorithms  to  provide  real-time 
response  to  individual  tasks. 

The  following  section  motivates  the  end-to-end  latency 
problem  using  an  idealized  version  of  our  video- 
conferencing  system  as  an  example.  Section  3  outlines  a 
real-time  message  passing  service  that  we  constructed 
within  the  YARTOS  (Yet  Another  Real-Time  Operating 
System)  kernel  [7].  We  show  how  this  service  reduces 
worst  case  end-to-end  latency  and  how  it  can  be  efficiently 
implemented.  The  YARTOS  message  passing  service  is 
currently  being  ported  to  the  Real-Time  Mach  kernel  [11] 
and  will  form  the  basis  for  a  comparative  study  of  the  real¬ 
time  performance  of  the  YARTOS  and  RT-Mach  thread 
models. 

2.  The  End-to-End  Latency  Problem 

Real-time  computations  require  bounded  response  times. 
In  general,  by  employing  results  from  the  real-time 
scheduling  literature  (e.g.,  [10]),  for  relatively  simple 
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models  of  computation,  it  is  possible  to  (1)  determine 
conditions  under  which  it  is  theoretically  possible  to 
guarantee  that  an  invocation  of  a  task  will  complete 
execution  by  a  certain  point  in  time,  and  (2)  allocate 
resources  within  an  operating  system  to  ensure  that  an 
invocation  of  a  task  actually  achieves  its  response  time 
bound. 

Often,  it  is  desirable  to  guarantee  a  response  time  to  a 
collection  of  cooperating  tasks  that  execute  in  concert  to 
realize  some  application.  For  example,  consider  the  video 
processing  portion  of  a  desktop  videoconferencing 
application.  The  goal  of  this  application  is  to  acquire, 
compress,  and  transmit  a  logically  infinite  sequence  of 
digitized  video  frames  across  a  network.  The  application 
is  composed  of  the  following  (idealized)  tasks; 

•  FG  —  a  task  to  control  a  frame-grabber  that  digitizes 
video  frames  generated  by  a  camera, 

•  UP  —  a  task  to  invoke  user  programs  on  the  digitized 
frames  for  any  user-level  image  processing  that  is 
desired  (e.g.,  for  feature  extraction  and  notification), 

•CP  —  a  task  to  compress  the  digitized  frame, 

•  FF  —  a  task  to  format  and  fragment  the  compressed 
ffame(s)  into  network  packets  for  delivery  across  a 
network,  and 

•  NI  —  a  task  to  control  the  network  interface  hardware. 

These  tasks  cooperate  to  form  a  simple  pipeline.  Every 
video  frame  generated  by  the  camera  is  digitized,  processed 
by  the  user,  compressed,  formatted,  and  delivered  to  the 
network  interface  for  transmission  across  the  network. 
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In  order  for  this  conferencing  application  to  be  effective, 
two  real-time  constraints  must  be  met.  First,  every  video 
frame  that  is  generated  by  the  camera  must  make  it 
through  all  stages  of  the  pipeline  and  be  delivered  to  the 
network  interface.  Second,  the  end-to-end  latency  of  each 
frame  —  defined  as  the  difference  between  the  time  the 
frame  arrives  at  the  network  interface  and  the  time  the 
frame  was  generated  —  must  be  kept  to  a  minimum. 
Since  current  video  cameras  and  frame-grabbers  generate 
data  at  regular,  periodic  intervals,  the  first  constraint  is 
easily  satisfied  by  implementing  each  stage  of  the  pipeline 
as  a  periodic  task  and  using  any  number  of  real-time 
scheduling  algorithms  from  the  literature  to  schedule  the 
tasks.  The  second  constraint  is  not  so  easily  satisfied. 

An  (NTSC)  video  frame  is  generated,  and  enters  the 
pipeline,  every  33.3  ms.  In  our  implementation  of  the 
above  video  pipeline,  every  task  has  a  period  of  33.3  ms. 
Since  there  are  5  stages  in  the  pipeline,  the  worst  case  end- 
to-latency  of  a  video  frame  is  166.6  ms.  The  worst  case 
occurs  when  each  invocation  of  each  task  completes  as  late 
as  possible  within  its  period  and  stage  i  +  1  of  the 
pipeline  is  not  invoked  until  stage  i  has  completed  as 
shown  in  Figure  1.  Whether  or  not  the  worst  case 
actually  occurs  will  depend  on  factors  that  are  beyond  the 
application  writer’s  control  such  as  the  magnitude  of  the 
total  system  workload  {e.g.,  the  number  of  other  real-time 
and  non-real-time  tasks  sharing  the  processor). 

Note  that  the  latency  bound  of  166.6  ms  is  really  an 
artifact  of  the  pipeline  implementation  of  the  video 
application  and  is  not  fundamental  to  the  conferencing 
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problem  itself.  For  example,  consider  an  implementation 
of  the  conferencing  application  that  combines  functions 
FG,  UP,  CP,  FF,  and  NI  into  a  single  task.  If  the  task 
had  a  period  of  33.3  ms,  then  all  video  frames  that  are 
generated  will  be  delivered  to  the  network  interface 
(assuming  the  system  is  still  schedulable)  and  the  worst 
case  end-to-end  latency  of  each  frame  will  be  no  more  than 
33.3  ms. 

The  problem  is  that  it  is  not  always  possible  (or  desirable) 
to  collapse  all  appli^^ation  and  system  functions  into  a 
single  task.  In  particular,  when  executed  on  top  of  a 
general  purpose  operating  system,  many  of  the  real-time 
application’s  functions  (such  as  input  and  output),  are 
implemented  by  operating  system  system  calls  or  servers 
(and  associated  device  drivers)  and  are  shared  with  other 
applications. 

The  challenge  therefore  is  to  support  the  pipeline  model  of 
application  design  and  execution  while  not  incurring  the 
penalty  inherent  in  the  straightforward  realization  of  the 
pipeline.  Specifically,  we  would  like  to  structure  the 
conferencing  application  as  a  series  of  cooperating  tasks 
and  maintain  a  worst  case  end-to-end  latency  bound  of  33.3 
ms  —  the  period  of  a  single  stage  of  the  pipeline. 

Note  that  in  principle  this  should  be  possible  since  the 
total  amount  of  computation  (ignoring  operating  system 
overhead)  performed  by  the  single  and  multi-task 
implementations  of  the  application  are  the  same.  The 
only  difference  is  that  in  the  multi-task  implementation  of 
the  application,  several  video  frames  may  be  processed 
simultaneously  (/.e.,  several  video  frames  may  be  in  the 
pipeline  at  any  one  time). 

3.  A  Real-Time  Message  Passing  Service 

Our  solution  to  the  problem  of  minimizing  end-to-end 
latency  is  to  make  the  pipeline  structure  of  the 
conferencing  application  known  to  the  kernel  and  to  use 
this  information  to  schedule  the  stages  of  the  pipeline  as  if 
they  were  part  of  a  larger  sequential  program.  This 
technique  has  been  implemented  as  part  of  the  message 
passing  system  in  the  YARTOS  kernel  [7].  We  begin 
with  an  overview  of  the  YARTOS  programming  model. 


The  YARTOS  kernel  supports  a  simple  data-flow  model 
of  real-time  computation.  Briefly,  applications  are 
composed  of  tasks,  resources,  and  ports.  Tasks  are  threads 
of  control,  resources  are  shared  abstract  data  types,  and 
ports  are  queues  for  messages.  Tasks  communicate  with 
other  tasks  by  sending  messages  to  ports.  Each  port  is 
bound  to  a  unique  task.  When  a  message  is  sent  to  a  port, 
the  kernel  schedules  the  task  bound  to  the  port  so  that  the 
message  will  be  consumed  before  a  deadline  defined  by  the 
rate  at  which  the  message  sender  emits  messages.  The 
deadline  is  chosen  so  as  to  ensure  that  all  messages  from 
this  sender  can  be  processed  in  real-time  {i.e.,  without  any 
buffering).  (The  YARTOS  programming  model  is 
explained  in  greater  detail  in  [9].  The  scheduling 
algorithm  used  in  the  kernel  is  described  in  [6].) 

When  tasks  and  ports  are  created,  the  kernel  constructs  a 
directed  graph  of  all  possible  communication  paths.  When 
a  message  is  sent  from  task  7/  to  task  Tj,  the  deadline  for 
task  Tj  is  computed  using  the  time  of  r,’s  most  recent 
invocation  as  invocation  time  for  Tj.  That  is,  tasks  7,  and 
Tj  are  scheduled  as  if  they  were  invoked  simultaneously  — 
as  if  they  were  a  single  task. 

For  example,  assume  task  7^  is  invoked  at  time  t  and  has  a 
deadline  at  time  t  +  p.  7,  executes  sometime  during  the 
interval  [r,  t+p]  and  sends  a  message  t  k  7,.  No  matter 
when  the  message  is  actually  sen  Tj,  task  7^  is 
considered  to  have  been  invoked  at  time  i.  It  is  a  property 
of  the  YARTOS  programming  model  that  the  invocation 
of  task  Tj  “occurring”  at  time  t  cannot  have  a  deadline 
before  time  t  +  p.  Therefore,  during  the  interval  [/,  r+p], 
Tj  will  not  preempt  7,  and  when  7y  is  dispatched,  there 
will  be  a  message  from  task  7,  for  it  to  process. 

The  one  exception  to  these  invocation  rules  is  when 
messages  arrive  from  the  outside  world  (e.g.,  from 
interrupt  handlers).  When  a  task  receives  a  message  from 
an  external  process,  the  task’s  deadline  is  computed  from 
the  arrival  time  of  the  message  (using  application  specified 
parameters  that  are  sufficient  for  providing  the  desired  r  ^al- 
time  response  to  the  external  process). 

With  this  message  passing  scheme,  the  time  required  in 
the  worst  case  for  a  message  to  pass  through  tasks  7,  and 
7y  in  YARTOS  is  the  same  as  the  time  required  in  the 
worst  case  for  a  message  to  be  processed  by  a  single  task 
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that  combined  the  functions  of  Tj  and  Tj.  Thus  the  worst 
case  end-to-end  latency  of  a  multi-threaded  YARTOS 
application  is  not  a  function  of  the  task  structure.  Rather, 
it  is  a  function  of  the  deadlines  associated  with  application 
messages. 

For  example,  in  our  videoconferencing  application,  all 
messages  have  a  deadline  for  processing  of  33  ms.  If  the 
FG  task  receives  a  message  (an  interrupt  in  this  case)  at 
time  r,  then  if  this  message  results  in  messages  being  sent 
to  tasks  UP,  CP,  FF,  and  Nl,  ail  messages  will  be 
processed  at  or  before  time  t  +  33.  Therefore,  each  video 
frame  is  delivered  to  the  network  interface  no  more  than  33 
ms  after  it  was  generated. 

The  alternate  t^proach  to  minimizing  end-to-end  latency  is 
to  combine  all  video  processing  tasks  into  a  single  task. 
However,  in  our  system  tasks  FF  and  Nl  are  actually 
general  purpose  operating  system  services  that  are  shared 
with  other  user  applications  and  hence  can  not  be 
embedded  directly  into  the  conferencing  application.  (In 
fact,  it  is  largely  for  this  reason  that  a  common  approach 
to  achieving  real-time  performance  in  general  purpose 
operating  systems  has  been  to  move  application  code  into 
the  operating  system  where  finer-grain  control  over 
resource  allocation  is  also  usually  possible.) 

4.  Related  Work 

Our  message  passing  system  is  related  to  the  paradigm  of 
communication  and  scheduling  integration  reported  by 
Draves  et  al.  [2].  In  this  work  a  scheduling  and  context¬ 
switching  mechanism  based  on  the  programming  language 
concept  of  continuations  is  introduced  to  allow  an 
applications  that  consists  of  multiple  threads  to  execute 
more  like  a  single  threaded  application.  The  emphasis  is 
[2],  however,  was  on  reducing  system  overhead.  Our  work 
seeks  to  minimize  worst  case  end-to-end  latency. 

Other  related  work  includes  the  general  priority  model  of 
Harbour  et  al.  [3],  wherein  periodic  tasks  can  be., 
decomposed  into  subtasks  that  may  have  varying 
execution  priority.  In  such  a  model  it  is  possible  to  more 
directly  express  and  reason  about  what  we  have  called  end- 
to-end  latency  constraints.  In  our  work  we  have  argued 
that  a  simple  scheduling  algorithm  (described  in  [6])  is 
sufficient  for  managing  latency. 


Lastly,  the  flow  shop  scheduling  results  of  Bettati  and  Liu 
1 1 1  are  relevant.  They  consider  the  problem  of  minimizing 
end-to-end  latency  in  a  system  of  multiple  processing 
elements  (e.g.,  a  distributed  system).  We  have  only 
considered  the  latency  problem  on  a  single  shared 
pi .  cessor. 

5.  Conclusions  and  Future  Work 

The  design  of  general  purpose  operating  systems  impose 
constraints  on  the  way  one  can  structure  real-time 
applications.  Common  operating  system  services  such  as 
network  transport  protocols,  and  device  management  need 
to  be  used  by  real-time  applications.  Because  such 
services  are  shared  with  other  applications  they  cannot  by 
tightly  bound  to  the  real-time  applications.  We  have 
shown  that  making  application  inter-task  communication 
paths  known  to  the  kernel,  one  can  provide  a  worst  case 
end-to-end  application  latency  bound  that  is  the  equivalent 
to  the  bound  for  an  implementation  of  the  application  as  a 
single  task. 

While  we  described  the  real-time  message  passing  service 
within  the  context  of  an  application  whose  tasks  form  a 
pipeline,  the  service  can  be  applied  to  any  graph  structure 
to  minimize  latency  of  message  communication  along  any 
path  in  the  graph. 

Currently  we  are  porting  the  YARTOS  message  passing 
service  to  RT  Mach  (MK83)  kernel  and  hope  to  compare 
the  end-to-end  latency  of  applications  using  the  RT  Mach 
and  YARTOS  communication  primitives. 
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Abstract 

Researchers  have  used  advances  in  hardware  tech¬ 
nology  to  design  larger  and  more  complex  real-time  ap¬ 
plications.  Larger  applications  require  new  integration 
techniques  while  more  complex  applications  require  a 
restructuring  of  the  underlying  system  support.  IVe 
examine  the  system  design  issues  of  supporting  SPAR- 
TAs  (Soft  PArallel  Real-Time  Applications).  There 
exists  a  gap  between  hard  real-time  kernel  mechanisms 
and  the  functionality  desired  by  a  SPARTA  program¬ 
mer.  Thus,  an  integral  part  of  supporting  SPARTA 
design  will  be  providing  an  intermediate  runtime  layer. 
We  describe  our  experiences  building  Ephor,  including 
what  motivated  its  conception  and  development,  and 
the  resulting  separation  of  responsibilities  both  easing 
SPARTA  design  and  improving  their  performance. 


1  Introduction 

Many  real-world  applications  contain  both  hard 
and  soft  real-time  components.  There  has  been  con¬ 
siderable  work  on  hard  real-time  system  design  such 
as  [3]  as  well  as  work  for  parallel  [1]  and  distributed  [2] 
environments.  Target  applications  for  hard  real-time 
systems  include  airplane  autopilot  and  nuclear  power 
plant  control.  New  complex,  parallel  soft  real-time 
applications  have  been  generating  considerable  inter¬ 
est.  Some  example  applications  are:  autonomous  nav¬ 
igation,  reconnaissance,  and  surveillance;  operator-in- 
the-loop  simulation;  and  teams  of  autonomous  coop¬ 
erating  vehicles.  The  real-time  community  has  ac¬ 
knowledged  the  need  to  explore  issues  raised  by  SPAR- 
TAs.  Stankovic  [4]  enumerates  a  number  of  issues  that 
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we  are  directly  addressing  in  the  design  of  Ephor*. 
Among  them  are  “what  are  the  correct  interfaces  to 
robotics,  RTAI,  Vision,  ...  etc.”  and  “what  function¬ 
ality  should  be  in  the  OS  level  and  what  in  the  appli¬ 
cation  level.” 

Designing  a  SPARTA  is  challenging,  since  in  its  full 
generality  it  calls  for  dynamic  decision  making  about 
resource  allocation,  scheduling,  choice  of  methods, 
and  handling  reflexive  or  reactive  behavior  smoothly 
within  a  context  of  planned  or  intended  actions,  and 
a  host  of  other  issues  not  typically  encountered  either 
in  off-line  or  hard  real-time  applications.  SPARTAs 
need  different  system  support  than  either  the  large 
data-crunching  scientific  programs  or  the  smaller  less- 
structured  applications  currently  being  investigated  in 
parallel  environments.  Further,  supporting  such  ap¬ 
plications  was  beyond  the  intended  scope  of  previous 
real-time  kernels  because  other  more  fundamental  or 
lower  level  issues  needed  to  be  addressed  first. 

Hard  real-time  systems  have  not  been  designed  to 
support  the  newly  evolving  soft  real-time  applications. 
In  particular,  they  lack  the  flexibility  needed  to  adjust 
to  a  complex  and  dynamic  environment.  The  reason 
is  that  they  must  provide  absolute  predictability  and 
guaranteed  scheduling.  Our  runtime,  Ephor,  inter2icts 
with  SPARTAs,  maintaining  hard  real-time  behavior 
when  needed  while  providing  graceful  degradation  in 
cases  where  performance  is  important  but  not  criti¬ 
cal  to  the  success  of  the  application.  Our  interme¬ 
diate  runtime  layer  is  built  on  a  hard  real-time  sub¬ 
strate  providing  the  additional  functionality  needed  by 
SPARTAs.  The  runtime  reduces  the  replicated  work  of 
system  monitoring  and  dynamic  decision-making  that 
is  common  between  applications. 

Initially,  the  effort  needed  to  develop  a  general  run¬ 
time  package,  such  as  Ephor,  versus  simply  incorpo¬ 
rating  the  needed  portions  into  an  application,  may 
appear  prohibitive.  However,  an  analogy  to  threeids 

•Ephor  was  the  name  of  the  council  of  five  in  ancient  Greece 
that  effectively  ran  Sparta 
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of  control  indicates  this  may  not  be  so.  Historically., 
threads  of  control  were  thought  of  simply  as  support 
for  co-routining  under  direct  user  control.  However, 
through  time,  many  other  issues  with  thread  manage¬ 
ment  have  arisen.  A  similar  situation  applies  with 
tasks,  methods,  or  even  planners  in  SPARTAs.  Syn¬ 
chronization  between  tasks  may  be  a  significant  issue, 
as  might  be  the  interleaving  of  tasks,  or  running  one 
based  on  £in  exception  generated  by  another,  etc.  Al¬ 
though  it  is  conceivable  these  issues  could  be  handled 
at  the  application  level  much  the  same  way  parallel 
thread  management  could  be,  there  are  compelling 
reasons  for  studying  the  systems  aspects  of  such  gen¬ 
eral  capabilities. 

The  rest  of  the  paper  is  structured  as  follows.  Sec¬ 
tion  2  describes  the  motivation  for,  and  our  experi¬ 
ences  leading  to,  designing  a  runtime.  Section  3  con¬ 
tains  the  separation  of  responsibilities  between  the  ap¬ 
plication,  runtime,  and  kernel,  facilitating  SPARTA 
design.  We  present  brief  concluding  remarks  in  sec¬ 
tion  4. 

2  Motivating  Factors 

Problem  Domain 

Our  research  work  in  this  area  developed  from  a  desire 
to  study  techniques  for  handling  overdemand  in  real¬ 
time  applications.  Originally,  our  plan  was  to  design 
a  real-time  application  containing  controllable  overde¬ 
mand  situations.  The  shepherding  problem  we  de¬ 
vised  contains  overdemand  as  well  as  many  other  gen¬ 
eral  properties  described  below  in  SPARTA  Proper¬ 
ties.  For  a  complete  description  of  shepherding  see  [5]. 
Briefly,  sheep  (small  vehicles)  move  around  in  a  field 
(table)  and  a  shepherd  (robot  arm)  tries  to  maximize 
the  number  contained  in  the  field.  Before  and  during 
implementation  we  observed  that  there  were  several 
mechanisms  that  would  have  been  useful  but  were  not 
provided  by  current  real-time  systems  such  as  con¬ 
currency  control,  dynamic  technique  selection,  help  in 
priority  assignment,  etc.  As  Stankovic  [4]  notes,  “Be¬ 
cause  of  these  reasons  many  researchers  believe  that 
current  kernel  features  provide  no  direct  support  for 
solving  difficult  time  problems,  and  would  rather  see 
more  sophisticated  kernels...”  A  system  that  provided 
these  features  would  be  very  useful.  Rather  than  in¬ 
cluding  these  features  into  the  kernel  and  sacrificing 
kernel  predictability,  we  placed  the  additional  func¬ 
tionality  into  Ephor,  our  runtime  environment.  This 
allows  the  designer  to  use  the  real-time  kernel  most  ap¬ 
propriate  for  their  environment.  The  application  still 
receives  the  same  functionality  (actually  more)  and  we 
maintain  a  predictable  kernel. 


In  designing  Ephor  in  conjunction  with  the  shep¬ 
herding  application  we  wanted  to  ensure  the  mech¬ 
anisms  developed  for  shepherding  would  be  applicable 
to  other  programs.  To  do  so,  we  designed  the  shepy- 
herding  application  to  have  many  of  the  same  prop¬ 
erties  as  the  soft  real-time  applications  mentioned  in 
the  introduction.  These  applications  contain  an  ele¬ 
ment  of  search  whereby  the  agent  determines  the  next 
course  of  action.  Most  are  designed  around  a  high- 
level  executive  instructing  lower  levels.  The  executive 
reasons  using  a  model  of  the  real  world  and  carries 
out  actions  in  it.  The  world  is  governed  by  general 
principles,  but  is  not  predictable.  There  is  often  an 
intermediate  layer  responsible  for  small  corrections  to 
the  requested  action  (servoing).  Also,  there  is  often 
a  low-level  layer  whose  actions  need  to  be  carried  out 
constantly  and  can  occur  “subconsciously”,  i.e.,  with¬ 
out  intervention  from  the  higher  levels. 

SPARTA  Design  Problems 

Under  the  standard  taxonomy  there  is  only  an  appli¬ 
cation  and  kernel.  Any  operations  or  functions  not 
performed  by  the  kernel  are  the  responsibility  of  the 
application.  While  the  application  needs  to  respond  to 
the  environment,  determine  the  next  course  of  action, 
and  evaluate  current  progress,  to  perform  reasonably 
it  also  needs  to: 

1.  Monitor  the  loeid  of  the  underlying  processors. 
Often  the  application  has  a  choice  of  tasks  to 
2M:complish  the  same  end.  Rather  than  sub¬ 
mit  impractical  tasks,  if  the  application  knew 
the  amount  of  compute  power  available,  it  could 
quickly  choose  the  appropriate  task  to  submit. 

2.  Maintain  a  list  of  allocated  resources.  As  in  1, 
a  wiser  teisk  submission  based  on  task  resource 
allocation  yields  better  behavior. 

3.  Trsick  execution  times  of  tasks,  especially  highly 
variable  ones.  We  have  found  that  locally,  the 
execution  time  of  a  task  is  fairly  predictable,  so 
knowing  the  last  several  execution  times  is  useful. 

4.  Determine  the  correct  interleaving  of  prioritized 
tasks. 

5.  Monitor  other  resources  such  ^ls  memory  usage, 
bus  utilization,  etc. 

It  was  our  objective  in  designing  Ephor  to  ensure  the 
enumerated  ideas  could  be  handled  by  the  generalized 
runtime,  thus  removing  a  significant  burden  from  the 
SPARTA  programmer. 


Implementation  Conclusions 

We  have  implemented  a  real-time  shepherding  simula¬ 
tor  and  have  almost  completed  implementation  of  the 
real-world  shepherding  application  using  small  vehi¬ 
cles,  a  robot  arm,  and  C2imeras.  Ephor  has  been  de¬ 
signed  and  implemented  and  handles  a  subset  of  the 
items  enumerated  in  the  previous  section.  Results  us¬ 
ing  Ephor’s  mechanisms  [5]  indicate  the  potential  of  an 
intermediate  runtime  to  facilitate  the  design  of  SPAR- 
TAs. 

Designing  the  shepherding  application  led  us  to 
the  conclusion  that  there  is  considerable  functional¬ 
ity  above  the  intended  realm  of  real-time  kernels  that 
an  increasingly  large  class  of  real-time  applications 
strongly  desire.  It  is  best  not  to  remove  the  pre¬ 
dictability  of  the  kernel  since  many  critical  applica¬ 
tions  depend  on  this  property.  However,  there  is  a 
large  class  of  applications  willing  to  relax  the  tight 
constraints  to  gain  increased  functionality.  These  soft 
real-time  applications  will  sacrifice  predictability  with 
or  without  a  runtime.  Further,  if  Ephor’s  mechanisms 
are  not  desired  for  a  portion  of  the  application  they 
may  be  ignored  causing  no  overhead  to  that  portion. 
For  this  reason,  and  others  mentioned  in  [5],  a  runtime 
layer  will  have  a  positive  effect  on  SPARTA  design. 
The  only  penalty  if  Ephor  is  completely  ignored  (or 
used  only  very  minimally)  is  the  one  processor  nor¬ 
mally  reserved  to  run  it  will  be  unavailable  for  other 
tasks.  Ignoring  Ephor  though,  defeats  the  purpose  of 
the  layering  provided  by  the  runtime.  While  it  is  cur¬ 
rently  possible  to  do  so,  we  may  discover  through  more 
experimentation  and  feedback  that  it  will  be  best  to 
prevent  this.  In  the  next  section  we  describe  the  re¬ 
sponsibilities  of  each  layer. 

3  Layer  Responsibilities 

By  defining  the  responsibilities  of  each  layer,  we 
clearly  define  the  obligations  of  the  SPARTA  designer. 
More  importantly,  we  can  specify  the  interface  to  the 
runtime,  treating  it  as  a  “black  box”  with  respect  to 
the  SPARTA  programmer.  A  contribution  and  impor¬ 
tant  part  of  our  work  is  a  clean  separation  of  responsi¬ 
bilities  for  each  layer  and  a  description  of  mechanisms 
provided  by  Ephor  independent  of  how  they  are  coded. 
Thus,  we  not  only  have  provided  mechanisms,  but  a 
methodology. 

Figure  1  shows  the  three  layers  in  designing  a 
SPARTA  and  underlying  system.  Information  is 
shared  across  the  runtime-application  boundary  by 
the  data  structure  appearing  in  the  Appendix.  The 
data  structure  provides  flexibility  even  beyond  its  pri¬ 
mary  purpose.  For  example,  it  is  two-way  writable  so 


the  application  can  give  hints  to  the  runtime  by  writ¬ 
ing  into  fields  the  runtime  is  normally  responsible  for 
updating.  The  clean  breakdown  of  information  com¬ 
munication  in  Fig.  2  has  been  cichieved  by  dividing 
layer  responsibilities  as  detailed  below. 

Application 

The  application  layer  is  solely  responsible  for  respond¬ 
ing  to  the  environment.  The  application  is  responsible 
for  communicating  to  the  runtime  (see  Appendix)  the 
different  goals  it  will  run  throughout  its  execution, 
the  different  techniques  it  has  for  solving  the  goals, 
and  the  relative  benefit  of  each  technique.  The  appli¬ 
cation  is  responsible  for  determining  the  interaction 
of  the  environment  and  goals  to  produce  the  intended 
behavior.  The  application  is  responsible  for  indicat¬ 
ing  when  a  new  goal  needs  to  be  solved  in  response 
to  an  environmental  stimulus,  or  as  the  result  of  a 
previously  completed  goal.  In  essence  the  application 
must  provide  the  flow  of  control  to  produce  the  desired 
program. 

Runtime 

The  runtime  receives  the  structure  of  the  goals  the  ap¬ 
plication  will  submit  throughout  its  execution  via  the 
data  structure  in  the  Appendix.  As  can  be  observed 
from  the  Appendix,  items  implement  techniques,  and 
techniques  satisfy  goals.  The  runtime  is  responsible 
for  determining  the  execution  time  of  the  different 
techniques  for  solving  the  goals.  The  application  can 
explicitly  provide  the  times,  or  the  runtime  may  dy¬ 
namically  gather  them  during  the  program’s  execu¬ 
tion.  The  latter  method  ^lllows  Ephor  to  adapt  in  a 
changing  environment.  Ephor  is  responsible  for  deter¬ 
mining  and  running  the  appropriate  items  needed  for 
completing  a  selected  technique.  In  selecting  the  tech¬ 
nique  to  solve  a  requested  goal,  Ephor  need  not  only  be 
aware  of  the  program  structure  and  application,  but 
also  of  the  internal  state  of  the  system.  The  runtime 
is  therefore  responsible  for  interfacing  to  the  under¬ 
lying  kernel  to  obtain  the  information  it  needs.  It  is 
responsible  for  monitoring  the  following  resources  nec¬ 
essary  to  select  dynamically  the  best  technique;  pro¬ 
cessor  loeid,  expected  available  processors  for  running 
parallel  techniques,  memory  or  cache  utilization,  bus 
utilization,  and  other  current  resource  allocation  such 
as  range  sensors,  manipulator,  or  cameras.  Ephor  is 
responsible  for  maintaining  a  central  location  where 
resource  allocation  information  can  be  quickly  and  co¬ 
herently  obtained  by  either  Ephor  or  the  application. 

To  provide  a  comprehensive  runtime  package  we 
have  implemented  many  mechanisms  in  Ephor.  While 
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Figure  1:  The  Three  Layers  in  the  Design  of  a  Real-World  Application 


Program  Structure  Scheduling  requests 

Flow  of  control  -  goal  requests  Priority  assignments 


Figure  2:  Layer  Information  Exchange 


it  is  beyond  the  scope  of  this  paper  to  discuss  them 
in  detail  individually,  we  give  a  list  to  provide  an 
overview  of  Ephor’s  functionality:  1)  dynamic  tech¬ 
nique  selection  based  on  internal  system  state,  2) 
schedule  tasks  using  derivative  worst  case  policy,  3) 
dynamic  parallel  process  control,  4)  de-scheduling  of 
all  running  tasks  associated  with  a  particular  goal, 
5)  automatically  time  tasks  and  update  their  status 
block,  6)  data  structure  to  share  information  between 
the  runtime  and  application,  7)  parallel  scheduling  of 
hard  and  soft  real-time  tasks,  8)  overdemand  detec¬ 
tion  and  recovery*,  9)  early  termination  of  tasks*,  10) 
automatic  resource  allocation  based  on  goal  priority*. 

A  difficulty  in  soft  real-time  systems  is  evaluating 
the  performance  of  a  given  mechanism  since  there  is  of¬ 
ten  not  a  hard  metric  that  can  be  used  to  judge  success 
or  failure.  Part  of  our  continuing  work  involves  defin¬ 
ing  suitable  metrics  for  measuring  our  mechanisms. 
Here  we  point  to  one  particular  case  of  how  the  dy¬ 
namic  technique  selection  mechanism  of  Ephor  per¬ 
forms  (a  more  complete  analysis  of  this  mechanism  can 
be  found  in  [5]).  Figure  3  represents  the  performance 
of  two  different  planners  (A  and  B)  and  the  dynamic 
selection  of  them  (with  runtime).  The  runtime  results 
were  obtained  by  having  Ephor  dynamically  select  be¬ 
tween  the  two  planners  based  on  the  internal  state  of 
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Figure  3:  Performance  of  dynamic  technique  selection 
Kernel 

We  assume  a  hard  real-time  substrate.  The  kernel  is 
expected  to  provide  fundamental  real-time  properties 
and  mechanisms  such  as  a  predictable  scheduling  pol¬ 
icy,  an  accurate  real-time  clock,  guaranteed  deadlines, 
task  priorities,  and  other  properties  typically  associ¬ 
ated  with  real-time  kernels  [4].  The  kernel  is  respon¬ 
sible  for  handling  interrupts,  and  allocating  resources 
as  directed  by  Ephor. 

4  Conclusions 

In  this  paper  we  discussed  the  difficulties  in 
SPARTA  design  and  argued  for  a  runtime  layer  to  al¬ 
leviate  some  of  these  problems.  The  benefits  of  our 
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run-time  system  are  two-fold.  First,  there  are  soft¬ 
ware  engineering  and  programming  benefits:  appli¬ 
cation  programmers  can  specify  their  needs  without 
knowing  how  they  will  be  fulfilled;  and  a  common  set 
of  mechanisms  can  be  reused  across  many  domains 
without  recoding.  Second,  by  thoroughly  investigat¬ 
ing  many  possibilities,  we  can  include  in  the  runtime 
those  mechanisms  yielding  the  best  performance. 

In  designing  a  Soft  PArallel  Real-Time  Applica¬ 
tion  (SPARTA)  we  discovered  that  the  functionality 
provided  by  current  hard  real-time  operating  systems 
was  limited.  We  developed  Ephor,  a  runtime  envi¬ 
ronment,  to  increase  the  functionality  of  the  system 
while  maintaining  a  predictable  kernel.  Our  design 
provides  a  clean  separation  of  responsibilities,  facil¬ 
itating  SPARTA  design  and  improving  their  perfor- 
memce. 

Acknowledgements 

Tom  LeBlanc  provided  helpful  discussions  and  com¬ 
ments  on  this  work. 

Appendix 

Ephor-Application  Interface  (abridged) 


References 

[1]  V.  P.  Holmes  and  D.  L.  Harris.  A  designer’s  perspective 
of  the  hawk  multiprocessor  operating  system  kernel. 
Operating  Systems  Review,  23(3):  158-172,  July  1989. 

[2]  K.  Schwcm,  P.  Gopinath,  2ind  W.  Bo.  Chaos-kernel 
support  for  objects  in  the  real-time  domain.  IEEE 
Transactions  on  Computers,  36(8):904-916,  August 
1987. 

[3]  J.  Stankovic  eind  K.  Ramamritham.  The  design  of  the 
spring  kernel.  Proceedings  of  the  Real-Time  Systems 
Symposium,  pages  146-157,  December  1987. 

[4]  John  A.  Stankovic.  Real-time  operating  systems; 
What’s  wrong  with  today’s  systems  £ind  rese^^•ch  is¬ 
sues.  Real-Time  Systems  Newsletter,  8(l):l-9,  1992. 

[5]  Robert  W.  Wisniewski  and  Christopher  M.  Brown. 
Ephor,  a  nm-time  environment  for  parallel  intelligent 
applications.  In  Proceedings  of  The  IEEE  Workshop 
on  Parallel  and  Distributed  Real-Time  Systems,  pages 
51-60,  Newport  Beach,  California,  April  13-15,  1993. 


struct  ite*_t  f 

int  inp.kind;  /«  indicates  if  a  function  or  technique  inplenents  iten  */ 

union  inplenenter.tf 

funct.t  f.inplenent;  /*  pointer  to  function  that  iapleaents  this  iten  */ 
struct  technique.t  •t.inplenent;  /*  pointer  to  the  technique  iaplenenting  itea  */ 
>  iapleaenter; 

int  couplets ;  /«  0  aeans  nothing,  1  means  done,  anything  else  is  left  to 
the  interpretation  of  particular  item  •/ 


}; 


struct  technique_t  f 

int  priority;  /*  0  ->  n  the  lower  the  number  the  higher  the  priority  •/ 
int  cpu.tiae;  /*  the  median  of  the  last  three  tines,  provides  a  good 

dynamic  estimate  rather  that  using  worst  case  tine  */ 
int  cpu_times[3] ;  /•  used  to  compute  cpu.time  see  above  */ 

int  cpu.index;  /*  indicate  which  cpu_time  index  to  write  to  •/ 
float  cpu.per;  /*  percentage  of  an  INTERVAL  of  a  cpu  this  technique  needs  */ 
float  aax_ch^mge;  /*  number  indicating  the  maximum  percentage  chemge  */ 
int  cutoff_tiae;  /«  when  the  system  should  terminate  this  technique  */ 
int  memory;  /«  space  required  by  this  technique,  instructions  emd  data  */ 
int  *itea_par_des;  /*  the  parallel  ordering  on  items  similar  numbers  may  proceed 

in  parallel,  low  numbers  must  be  run  before  higher  numbers  •/ 
struct  item.t  «item;  /*  set  of  items  that  implement  this  technique  «/ 


struct  goal_t  { 

boolean  periodic;  /*  TRUE  indicates  task  is  periodic  */ 

int  rate;  /*  if  periodic  is  true  the  system  runs  this  goal  at  rate  */ 

int  run.technique ;  /«  technique  that  should  currently  be  run  to  satisfy  goal  */ 

int  numb.techniques ;  /*  number  of  different  possible  techniques  for  this  goal  */ 

struct  technique.t  technique [MAX.TECHS] ;  /*  list  of  techniques  to  solve  goal  */ 

}; 


MAIN  struct  goal_t  goal_listCHAX_GOALS] ;/*  a  list  of  goals  for  the  application*/ 
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Abstract 

Tht  Spring  Kernel  and  associated  algorithms,  lan¬ 
guages,  and  tools  provide  system  support  for  static  or 
dynamic  real-time  applications  that  require  predictable 
operation.  Spring  currently  consists  of  two  major 
parts:  (1)  the  development  environment,  where  appli¬ 
cation  and  target  systems  are  described,  preprocessed 
and  downloaded,  and  (2)  the  run-time  environment, 
where  the  operating  system,  the  Spring  Kernel,  creates 
and  ensures  predictable  executions  of  application  tasks. 
Very  recently,  we  have  integrated  our  real-time  systems 
technology  with  component  technologies  from  robotics, 
computer  vision,  and  real-time  arttficM  intelligence, 
to  develop  a  test  platform  for  flexible  manufacturing. 
But  the  results  being  produced  are  generic  so  that  they 
should  be  in  many  other  real-time  applications  such  as 
air  traffic  control  and  chemical  plants.  In  this  paper 
we  desert  this  platform,  identify  new  features  Oiat 
we  developed,  drCi  comment  on  some  lessons  learned 
to  date  fnm  this  experiment. 

1  Introduction 

The  Spring  Kernel  [6]  is  designed  to  conform  to 
the  requirements  of  a  wide  range  of  highly  predictable 
and  dynamic  real-time  applications.  The  strength  of 
its  run-time  support  lies  in  the  use  of  its  two  distinct 
scheduling  concepts  as  well  as  its  support  for  both  pre¬ 
dictability  and  flexibility.  In  regards  to  scheduling,  it 
supports  a  once-guaranteed-always-guaianteed  policy 
suited  for  applications,  or  parts  of  an  application  that 
impose  stringent  time  and  resource  requirements,  or 
where  once  an  operation  begins  it  cannot  be  undone. 
It  also  provides  a  best-^ort  type  of  scheduling,  where 
the  system  dynamically  creates  a  schedule  to  mazi- 
mise  the  total  accrued  system  value.  Subject  to  sys¬ 
tem  requirements,  these  two  scheduling  concepts  can 
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coexist  within  one  application.  The  kernel  abo  retains 
a  signifleant  amount  of  semantic  informatiou  at  run¬ 
time,  supplied  by  the  system  designers  and  program¬ 
mers  as  weU  as  extracted  by  various  tools  such  as  the 
compiler.  This  information  is  then  utilised  at  runtime 
providing  a  high  degree  of  flexibility  when  required. 
This  flexibility  is  subject  to  the  scheduling  rules  and 
algorithms  so  that  predictability  is  retained  even  as 
the  system  is  adapting.  We  refer  to  this  architectural 
strategy  as  a  reflective  architecture  [7].  While  many 
papers  have  been  written  describing  various  innova¬ 
tions  in  the  Spring  kernel  and  its  associated  scheduling 
algorithms  [5,  6],  in  this  paper  we  report  on  new  ex¬ 
tensions  and  lessons  learned  when  applying  the  ideas 
and  system  to  flexible  mtmufacturing  [l,  2J.  The  ex¬ 
tensions  arise  both  in  developing  a  complete  runtime 
platform  and  because  of  the  need  for  integrating  com¬ 
ponent  technologies  from  robotics,  computer  vision, 
and  real-time  AI  with  this  real-time  computing  plat¬ 
form. 

2  The  Flexible  Manufacturing 
Testbed 

The  basic  idea  underlying  the  testbed  is  to  model 
applications  which  provide  predictable  responses  while 
functioning  in  nondeterministic  environments.  In 
this  testbed  (see  Figure  1),  objects  or  material 
needed  by  consuming  processes  are  introduced  non- 
deterministically  into  a  circular  queue  awaiting  pro¬ 
cessing.  A  consuming  process  can  request  (a  specific 
combination  of)  objects  at  any  time,  and  it  can  set  ar¬ 
bitrary  deadlines  and  values  for  the  objects  needed  by 
it.  The  circular  queue  can  control  the  type  of  incoming 
material  and  the  rate  at  which  it  is  processed. 

To  maximize  total  accrued  system  value  (in  terms 
of  objects  delivered  in  time  to  consuming  processes) 
and  resource  utilization.  Spring’s  dynamic  scheduling 
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Figure  1;  Schematic  of  the  Flexible  Manufacturing  Testbed 


algorithm  is  used  to  manage  the  resources  in  the  sys¬ 
tem.  Specifically,  based  on  resource  availability,  spe- 
dftc  requests  of  consuming  processes,  as  well  as  spe- 
diic  timing  constraints,  objects  are  removed  from  the 
circular  queue  and  placed  on  a  delivery  platform  in 
a  timely  manner.  The  delivery  platform  represents  a 
user  specified,  finite  capacity  intermediate  storage.  All 
objects  on  this  platform  are  guaranteed  to  meet  their 
timing  constraints  and  to  be  delivered  to  the  down¬ 
stream  processes  as  requested  and  in  time. 

The  entire  specification  of  such  a  system,  from  ap¬ 
plication  processes  to  target  hardware,  is  described 
using  a  System  Description  Language  (SDL)  [4].  On 
the  other  hand,  all  system  dynamics  are  hanmed  by 
the  Spring  Kernel. 

This  testbed  has  characteristics  applicable  to  a  wide 
variety  of  actual,  real  world  applications  including: 

1.  Air  TVaffic  Control  —  airplanes  can  enter  a  hold¬ 
ing  pattern,  they  can  leave  if  fuel  is  low  or  if  they 
cannot  be  guaranteed  timely  landing,  required 
terminal,  and  service  crew.  If  they  are  admit¬ 
ted  to  the  runway  they  must  be  guaranteed  fuU 
service. 

2.  Chemical  Plants  —  different  chemicals  ready  for 
mixing  are  placed  on  the  conveyer  belt.  The  con¬ 
suming  processes  require  chemicak  to  be  mixed 
in  a  certain  order.  Due  to  chemical  deterioration, 
each  chemical  solution  must  be  either  delivered 
within  a  certain  time,  or  it  must  be  disposed  to 
prevent  haiards. 

3.  Flexible  Manufacturing  —  robotic  workcells,  de¬ 
picted  in  the  figure  above  as  the  consuming  pro¬ 
cess,  can  perform  incremental  operations  such  as 
mechanical  assembly  or  quality  control  inspec¬ 
tion,  and  deliver  finished  components  to  a  sub¬ 
sequent  manufacturing  operation. 


3  Extensions  and  Lessons 
Learned 

Our  prior  work  on  scheduling  and  operating  system 
support  for  real-time  systems  [5,  6]  as  well  as  recent 
work  on  softweure  description  languages  and  tools  [4] 
was  not  targeted  at  any  particular  application.  Our 
goal  was  to  design  them  to  be  applicable  to  time- 
critical  applications  that  had  a  broad  set  of  charac¬ 
teristics. 

In  this  section  we  discuss  how  some  of  these  con¬ 
cepts  and  their  practical  realisations  in  Spring  had  to 
be  extended  to  apply  them  to  the  flexible  manufactur¬ 
ing  platform. 

3.1  Scheduling  Extensions 

For  flexible  manufacturing,  it  was  necessary  to  sup¬ 
port  tasks  with  precedence  constraints,  shared  re¬ 
sources  and  multiprocessing.  Spring  scheduling  algo¬ 
rithms  had  been  designed  with  such  requirements  in 
mind  [5]. 

However,  additional  requirements  were  imposed  by 
the  fact  that  scheduling  must  occur  at  different  levels 
of  abstraction.  At  the  higher  level  the  system  must 
deal  with  orders  for  products,  resources  which  con¬ 
sist  of  parts  and  subcomponents  to  be  automatically 
assembled  -  constrained  by  robots,  floor  space,  cost, 
and  expected  profits.  Decisions  made  at  this  level, 
handled  by  the  real-time  AI  subsystem,  determined 
which  of  the  incoming  orders  need  to  be  carried  out, 
computational  resources  permitting.  Whether  it  was 
possible  to  carry  out  a  specific  order  was  determined 
by  the  scheduler  at  the  lower  level,  where  the  sys¬ 
tem  deals  with  the  computational  resources  needed  to 
move  robots,  assemble  products,  etc.  It  was  neces- 
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sary  to  implement  both  levels  of  scheduling  with  a 
feedback  interface  between  these  leveb.  For  example, 
if  the  higher  level  decides  to  make  certain  products, 
the  actual  manufacturing  floor  may  not  be  capable 
of  performing  these  tasks  in  time.  Such  information 
supplied  to  the  higher  level  scheduler  improves  perfor¬ 
mance  of  the  system  in  choosing  between  alternatives. 

Interesting  research  questions  regarding  multiple 
levels  of  schooling  include; 

•  how  to  partition  the  scheduling  functionality  be¬ 
tween  the  high  level  (RTAI)  planner  and  the  sys¬ 
tem  level  scheduler  (which  is  also  a  planner), 

•  what  information  should  be  passed  back  and 
forth  between  the  levels,  and 

•  how  to  pass  information  back  and  forth  between 
the  sch^ulers  so  as  to  get  the  best  performance. 

In  developing  the  flexible  manufacturing  testbed  we 
also  found  it  necessary  to  be  able  to  hold  resources 
across  a  set  of  processes.  For  example,  a  set  of  pro¬ 
cesses  may  require  a  common  tool  which  cannot  be 
shared  with  others.  This  adds  an  interesting  schedul¬ 
ing  complication  not  typically  addressed  by  real-time 
scheduling  algorithms.  We  also  found  that  deadlines 
can  sometimes  be  relaxed  within  a  deadline  tolerance 
[3].  Interesting  questions  involve  understanding  the 
cumulative  effect  of  missing  the  original  deadline,  but 
satisfying  the  tolerance  factor. 

3.2  Extensions  to  the  Kernel  to  Interface 
with  External  Subsystems 

As  originally  implemented,  the  Spring  kernel  con¬ 
trolled  computational  activities  running  on  processors 
within  the  multiprocessor  Spring  node(s).  However,  in 
the  flexible  manufacturing  platform  there  was  a  need 
to  control,  in  a  timely  fashion,  not  only  these  activ¬ 
ities,  but  also  robotic  and  vision  processes  running 
outside  the  Spring  node(s).  This  is  because,  given 
that  commercial  robotics  and  vision  subsystems  are 
highly  developed  and  efficient,  it  is  necessary  to  be 
aUe  to  link  such  subsystems  into  a  flexible  manufac¬ 
turing  plant  without  re-implementation.  To  do  this, 
we  defined  and  implemented  a  remote  interface  so  that 
such  subsystems  can  be  added  to  the  Spring  runtime 
platform,  as  long  as  they  adhere  to  the  interface  re¬ 
quirements.  The  interface  allows  the  Spring  scheduler 
to  control  the  tuning  properties  of  the  external  sub¬ 
systems.  The  Spring,  vision,  and  robotic  subsystems 
interact  via  distributed  shared  memory,  implemented 
using  ScramNet[8]. 

3.3  System  Description  Language 

The  system  description  language  (SDL)  had  been 
designed  to  provide  a  way  for  developers  to  specify 


the  properties  of  ^dl  parts  of  the  system  in  great  de¬ 
tail,  independent  of  the  application  functionality  [4j. 
To  provide  easier  system  integration  and  to  assist  in 
on-line  scheduling,  a  method  of  compiling  or  mapping 
program  representations  to  task  representations  vias 
also  developed.  The  compiled  information  is  available 
for  use  by  other  toob  as  well  as  the  runtime  system 
that  use,  modify,  or  add  to  it.  In  short,  SDL  provides 
support  for  specification,  compilation,  and  execution 
of  applications  on  the  Spring  system.  It  b  also  used 
for  the  specification  of  simulations  run  under  Spring’s 
scheduling  testbed.  In  thb  way,  both  the  actual  sys¬ 
tem  and  the  simulations  complement  each  other.  This 
is  especially  important  in  robotic  applications  where 
testing  must  be  done  with  simulated  robots  for  safety 
reasons,  before  you  control  the  actual  robots. 

While  the  SDL  provides  software  support  for  spec¬ 
ifying  important  information,  extensions  were  neces¬ 
sary  when  used  with  the  flexible  manufacturing  plat¬ 
form: 

•  techniques  to  describe  the  interfaces  with  the 
robotic  and  vbion  subsystems  and  how  they 
are  expected  to  operate  (including  the  expected 
workloads), 

•  specification  of  fault  tolerance  requirements  and 
design,  and 

•  linking  computer  aided  design  (CAD)  and  design 
for  manufacture  (DFM)  toob  to  SDL. 

Another  key  bsue  b  the  need  for  toob  to  aid  the 
system  builder  in  deriving  deadlines  from  time,  value 
and  fault  properties  of  the  application  and  environ¬ 
ment.  In  flexible  manufacturing  there  b  a  combination 
of  loose  time  constraints  and  very  precise  time  con¬ 
straints  which  are  linked  to  each  other.  For  example, 
the  arrival  of  parts,  orders  for  products,  and  delivery 
dates  and  times  usually  have  loose  and  unpredictable 
timing  requbements.  Yet,  sometimes  delivery  times 
are  very  strict  and  missing  the  deadline  can  cause  se¬ 
rious  financial  loss.  Further,  once  a  product  b  begun 
there  are  often  tight  and  strict  deadlines  to  be  met  in 
the  form  of  commands  to  the  robots  ebe  the  product 
b  ruined,  or  worse  the  robot  itself  b  damaged. 

3.4  Integrated  Simulation  Support 

Simulation  b  critical  to  flexible  manufacturing.  We 
are  developing  a  simulator  that  is  integrated  in  three 
ways:  across  multiple  leveb  of  scheduling,  with  the 
runtime  system  model,  and  with  actual  robotic  work- 
cells. 

In  order  to  understand  the  operation  of  the  sys¬ 
tem  prior  to  putting  it  into  use,  we  have  developed 
a  simulator  that  is  driven  from  an  application  level 
view.  Thb  consists  of  being  able  to  order  products, 
have  raw  materials  arrive,  and  see  the  flexible  manu¬ 
facturing  floor  in  operation,  i.e.,  visualbe  the  robots 
assembling  products.  Then  it  is  possible  to  zoom  into 
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the  system  level  and  see  what  computational  entities 
are  actually  running  to  effect  the  application  level  ac¬ 
tions.  The  schedules  are  dynamically  changed  as  the 
higher  level  actions  proce^,  precedence  constraints 
can  be  displayed,  and  summary  statistics  continuously 
updated.  These  features  provide  the  integration  across 
levels  of  scheduling. 

The  simulator  also  accurately  follows  the  runtime 
model  of  the  flexible  manufacturing  testbed  system  so 
that  the  same  workloads,  algorithms,  etc.  are  operat¬ 
ing  in  both  systems.  The  exact  same  tests  can  drive 
both  systems  and  the  outputs  can  be  directly  com¬ 
pared. 

We  also  have  plans  to  integrate  n  copies  of  the  sim¬ 
ulator  running  on  workstations  with  the  actual  robot 
workcell  so  that  a  distributed  factory  floor  can  be 
modeled.  This  will  permit  experimentation  with  dis¬ 
tributed  systems  and  coordination  algorithms. 
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4  Summary 

Attempting  to  use  the  Spring  algorithms  and  ker¬ 
nel  to  implement  the  flexible  manufacturing  platform 
has  resulted  in  significant  improvements  in  the  system 
and  has  identified  a  number  of  new  research  issues 
some  of  which  have  been  briefly  discussed  here.  As  of 
this  writing  the  physical  testbed  has  been  completely 
implemented,  the  kernel,  SDL  and  simulator  are  all 
functioning  with  two  exceptions:  (i)  part  of  the  inter¬ 
face  between  the  Spring  scheduling  algorithm  and  the 
real-time  AI  planner  remains  to  be  fully  debugged,  and 
(ii)  minor  modifications  required  by  the  scheduling  al¬ 
gorithm  must  be  made.  These  should  be  completed 
shortly. 
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1  Introduction 

Serializability  greatly  simplifies  Reasoning  about 
correctness  in  concurrent  systems,  including  real-time 
systems.  Our  research  addresses  concurrency  con¬ 
trol  protocols  that  accommodate  analytic  guaran¬ 
tees  of  schedulability,  can  be  implemented  with  small 
bounded  overheads  and  blocking,  and  ensure  serial¬ 
izable  execution  of  entire  tasks  including  complete 
read/compute/write  cycles  (as  opposed  to  serializable 
execution  only  of  short  embedded  transactions  with¬ 
out  computation.) 

One  such  protocol  which  combines  locking  and 
abort  is  described  below.  Among  its  interesting  prop¬ 
erties  are  that  transactions  scheduled  by  locking  are 
never  aborted,  tasks  are  aborted  only  due  to  conflict 
with  higher  priority  tasks,  and  the  cost  of  abortion  can 
be  bounded  for  the  purpose  of  schedulability  analysis.' 
The  protocol  is  illustrated  with  an  avionics  example 
adapted  from  [4].  The  priority  ceiling  protocol  can  en¬ 
sure  schedulability  of  8  tasks  if  serializability  of  only 
short  sequences  of  data  accesses  is  required,  but  can¬ 
not  schedule  even  the  first  2  tasks  if  serializability  is 
required  for  whole  tasks.  Under  reasonable  assump¬ 
tions  our  protocol  achieves  schedulability  of  the  first  6 
tasks  while  guaranteeing  serializability  of  entire  tasks. 

2  Model  and  assumptions 

We  consider  schedulability  for  a  set  of  tasks,  {n, 
T2,  . . . ,  TV; } .  We  consider  each  task  also  as  a  trans¬ 
action  which  may  contain  read  and  write  operations 
to  shared  data  and  is  a  unit  of  consistency,  i.e.,  if  ex¬ 
ecuted  alone,  it  transforms  the  shared  data  from  a 
consistent  state  to  another  consistent  state.  T;  is  the 
transaction  eissociated  with  task  n,  1  <  i  <  n.  ri[x] 
and  lUj  [x]  denote  Tj’s  read  and  write  operation  on  data 
item  X,  respectively.  Two  operations  are  said  to  con¬ 
flict  if  they  both  operate  on  the  same  data  item  and 
at  least  one  of  them  is  a  write. 

’Contraat  to  the  abort-oriented  protocol  prenentcd  in  [5], 
which  completely  avoids  priority  inversion  but  docs  not  bound 
abortion  cost. 
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<i  denotes  the  execution  order  of  operations  in  each 
transaction  7).  A  schedule  is  a  partial  order  (//,  </;) 
such  that  (1)  //  =  UxigrTi;  (2)  <//3  <>  •'■f'd 

(3)  for  any  two  conflicting  operation  p,q  £  H,  either 
p<H  qot  q  <jt  p. 

We  assume  static  priority  assignment  based  on  rate 
monotonic  analysis  (RMA)  [3].  Pi  denotes  the  period 
of  Ti.  To  control  priority  inversion,  we  use  a  variant  of 
the  the  stack  resource  policy  (SRP)  [1],  treating  reads 
and  writes  differently  as  in  the  read/write  priority  ceil¬ 
ing  protocol  (PCP)  [6];  we  will  call  this  R/W  SRP.  It 
is  interesting  to  note  that  the  stack-based  discipline 
is  critical  to  the  correctness  of  the  mixed  protocol  de¬ 
scribed  below,  as  well  as  its  performance. 

3  Mixed  locking/abort  protocol 

SRP  and  read/write  PCP  are  pure  locking  proto¬ 
cols  with  the  characteristic  that  worst-case  blocking  is 
limited  to  the  sizes  of  embedded  transactions  or  crit¬ 
ical  sections.  In  [6],  PCP  is  termed  a  2-phase  lock¬ 
ing  protocol,  but  embedded  transactions  are  presumed 
to  contain  only  database  operations.  This  limitation 
makes  controlling  priority  inversion  easier,  but  rea¬ 
soning  about  overall  system  correctness  harder.  Un¬ 
fortunately,  a  2PL-scheduled  transaction  consisting  of 
a  read-compute-write  sequence  must  not  begin  to  re¬ 
lease  locks  until  the  last  lock  has  been  obtained;  crit¬ 
ical  sections  may  therefore  be  nested  and  may  extend 
across  the  middle  compute  steps. 

The  protocol  described  below  uses  R/W  SRP  as 
a  baseline  strategy,  but  schedules  a  transaction  T 
by  abort  when  schedulability  analysis  reveals  that  T 
causes  excessive  blocking.  T  immediately  releases  each 
read  lock  after  its  read  access,  hence  incurring  less 
blocking  to  higher-priority  transactions.  Since  the 
protocol  is  not  2-phase,  a  combination  of  locking  and 
timestamps  are  used  to  order  conflicting  operations. 

We  assume  a  source  of  monotonically  incrccising 
timestamps.  Each  data  item  x  is  associated  with  two 
timestamps:  Rcur[x]  from  the  most  recent  read  oper¬ 
ation  and  Wmax[i]  from  the  most  recent  write  oper¬ 
ation. 
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•  In  transactions  scheduled  by  pure  locking,  locks  arc 
requested  as  in  2PL.  When  a  transaction  com¬ 
mits,  it  obtains  its  timestamp  ts(Ti)  and  for  each  da¬ 
tum  X  read(\vrittcn)  by  T},  sets  Ilcur[x]  (Wmax[x]) 
equal  to  ts(Tj)  and  then  releases  its  lock  on  x. 

•  In  transactions  scheduled  by  locking  and  abort,  7} 
obtains  its  timestamp  ts{Tj)  at  initiation. 

For  each  item  x  accessed  in  the  read  phase,  Tj 
first  sets  a  read  lock  on  x.  It  then  checks  whether 
Wmax[x]  >  ts(7})  and  aborts  if  so;  otherwise  it 
reads  x.  Rcur[a;]  is  set  to  ts(7}),  and  the  read  lock 
on  X  is  released. 

For  each  item  x  accessed  in  the  write  phase,  7}  sets 
a  write  lock  on  x.  It  then  chocks  jyhether  Rcur[j;]  > 
ts(7}),  and  if  so  it  aborts.  If  Wmaxfr]  >  ts(7}),  7} 
releases  the  write  lock  without  writing  (an  applica¬ 
tion  of  Thomas’  write  rule  [7]).  Otherwise,  it  places 
x  on  wList  and  delays  the  actual  write  operation 
until  tho  commit  phase. 

In  the  commit  phase,  each  x  on  wList  is  written, 
Wmax[a;]  is  set  to  ts(7}),  and  then  the  write  lock 
on  X  is  released. 

As  in  read/write  POP,  locks  in  our  protocol  are  set 
and  released  by  changing  the  r/w  priority  ceilings  of 
the  corresponding  data  items.  The  protocol  is  not  two 
phase  since  read  locks  of  each  transaction  scheduled  by 
abort/locking  are  released  immediately  after  access. 
To  guarantee  serializability,  timestamps  are  utilized, 
in  addition  to  locks,  to  order  conflicting  operations. 

Observe  that  Wmax[i]  holds  the  maximum  times¬ 
tamp  of  transactions  that  updated  Wmax[x]  (it  in¬ 
creases  monotonically)  but  Rcur[i]  holds  the  times¬ 
tamp  of  the  transaction  that  most  recently  updated 
Reurf®].  Permitting  Reur[ii']  te  be  updated  In  the 
“wrong”  order  complicates  the  correctness  proof,  but 
eliminates  a  costly  critical  section  in  an  implementa¬ 
tion. 

4  Properties  of  the  mixed  protocol 

We  denote  operations  on  x  by  o[x],  p[x]  and  ^[x] 
where  it  is  not  necessary  to  distinguish  between  read 
and  write,  ts*  denotes  acquisition  of  a  timestamp  by 
transaction  7*.  U^[x]  denotes  Tt’s  updating  of  Rcur[x] 
if  o  =  r  or  Wmax[x]  o  =  w.  We  assume  either 
U^[x]  <H  Ui[x]  or  U°[x]  <H  f/jt  W  for  any  two  trans¬ 
actions  Tjt  and  Ti  and  o  —  r  or  w. 

Observation  1  For  every  transaction  Tk  and  an  op¬ 
eration  ot[x]  ofTk,  tsk  <k  U^[x]  and  Oji,[x]  <»,  U^[x]. 
Hence,  tsk  <h  Uk[x]  and  Oit[x]  <//  U^[x]  for  every 
schedule  H . 


This  follows  from  the  rules  that  a  transaction  ac¬ 
quires  its  timestamp  before  it  updates  either  Rcur  or 
Wmax,  and  it  performs  each  of  its  operations  before 
it  updates  either  Rcur  or  Wmax. 

Definition  1  Given  any  operation  qj[x],  we  call p,[x] 
the  immediate  preceding  conflicting  operation  of  qj[x] 
if  p,(x]  executes  before  and  conflicts  with  (/^[x],  ana 
^Oit[x]  such  that  ot[x]  also  conflicts  with  (7j[x]  and 
<H  Ul[x\  <jf  Uj[x]. 

Note  that  ojt[x]  may  or  may  not  conflict  with  Pi[x]. 

Observation  2  If  Ok[x]  conflicts  with  Pi[x],  then 
Uk[x]  <//  Uf[x]  iff  Ok[x]  <11  Pi[x]. 

Informally,  timestamp  order  is  consistent  with  the 
order  of  operations.  This  follows  from  the  rules  that 
each  conflicting  lock  is  not  released  until  Rcur  or 
Wmax  has  been  updated.  By  Definition  1  and  Obser¬ 
vations  1  and  2,  we  have  Uf[x]  <if  gj[x]  <h  Uj[x]. 
The  next  observation  follows  directly  from  Defini¬ 
tion  I. 

Observations  If  q  =  w  (a  write  operation),  then 
both  Rcurfxj  and  WmaxfxJ  will  remain  unchanged  after 
Ui[x]  is  executed  and  before  qflx]  is  executed.  Ifq  =  r, 
then  WmaxfxJ  will  remain  unchanged  after  Ui[x]  is 
executed  and  before  g,  fx]  is  executed. 

Property  1  The  mixed  locking/abort  protocol  guar¬ 
antees  serializability. 

Proof:  Given  any  operation  qj  [x] ,  let  pi  [x]  be  its  imme¬ 
diate  preceding  conflicting  operation.  We  first  show  if 
qflx]  was  not  rejected  or  ignored,  then  ts(T<)  <  t8(Tj). 
There  nre  four  ewes  to  consider! 

Case  II  Both  Ti  and  7}  are  scheduled  by  pure  lock¬ 
ing:  By  the  pure  locking  scheduling  rule,  the  lock 
on  X  is  not  released  until  timestamp  is  acquired  and 
properly  attached.  Since  p,[x]  precedes  and  conflicts 
with  qj[x],  Ti  must  obtain  its  timestamp  before  7} 
did.  Hence,  ts(7i)  <  ts(7j). 

Case  12  7}  is  scheduled  by  abort/locking  while  7}  is 
scheduled  by  pure  locking:  7<  obtains  its  timestamp 
when  it  starts.  Also,  a  lock  is  set  on  x  before  p,[x]  is 
performed.  Since  7}  obtains  its  timestamp  when  it 
commits,  ts(7i)  <  ts(7}). 

Case  13  Ti  is  scheduled  by  pure  locking  while  7} 
is  scheduled  by  abort/locking:  Since  both  and 
7}  set  a  lock  on  x  and  p<[x]  precedes  qj[x],  by  the 
time  qj[x]  is  about  to  be  performed  p,[x]  and  (/[[x] 
must  have  been  done.  Since  pj[x]  is  the  immediate 
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preceding  conflicting  operation  of  9(1],  by  Observa¬ 
tion  3,  no  conflicting  operations  will  update  Rcur[x] 
or  Wmax[x]  or  both  after  is  executed  and  be¬ 
fore  9;[x]  is  executed.  If  ts{Ti)  >  ts{Tj),  then  7}  will 
be  aborted  unless  p  =  q  =  xu  (in  that  case  qj[x]  will 
be  ignored  by  Thomas’  write  rule). 

Case  14  Abort/locking  used  for  both  7)  and  Tj :  sim¬ 
ilar  to  case  13. 

Now  suppose  ts(7i)  <  ts(7}).  We  show  ts(T]t)  < 
ts(2y)  for  any  other  preceding  conflicting  operation 
Ojb[z]  of  gj[z].  Suppose  to  the  contrary  that  ts(Tjt)  > 
ts(7}).  We  have  ts,-  <h  isj  <h  isk  because  ts(7i)  < 
ts(2})  <  ts(7’*).  There  are  two  cases  to  consider; 

Case  01  Oi[z]  and  p,[z]  conflict:  Ok[x],  p,[z],  and 
9j[z]  are  pairwise  conflicting  operations.  By  our  as¬ 
sumptions,  ot[i]  <H  qj[x]  and  pAx]  <ji  9j[z].  If 
Pi[x]  <H  ojfcW,  then  p,[z]  <h  Okfx]  <h  qj[x].  By 
Observation  2,  C/f  [z]  <if  Uk[x]  </{  Uj[x],  which  vio¬ 
lates  Definition  1  for  immediate  preceding  conflicting 
operations.  Hence,  we  must  have  ot[z]  <n  p,[z]  <h 
gj[z]  and  Uk[x]  <h  Uf[x]  <»  Uj[x].  By  Observa¬ 
tion  1,  isjfc  <H  U^[x].  Hence,  the  schedule  must  con¬ 
tain  tSi  <H  iSj  <H  isk  <H  U^[x]  <H  Uf[x]  <H 
Uj[x].  Because  operations  of  Tj  and  7}  are  not  prop¬ 
erly  nested,  this  execution  violates  the  stack-based 
discipline. 

Case  02  o  =  p  =  r  and  q  =  xv:  Both  T*  and  7j 
must  modify  Rcurfx],  i.e.,  17*  [z]  and  U-[x].  Since 
both  rjt[z]  and  ri[z]  precede  and  conflict  with  iyj[z], 
by  Observation  2,  <h  Uf[x]  and 

Uflx].  By  our  assumption,  either  17*  [z]  Ui[x]  or 
<H  Ukl^].  If  f/n®]  <H  f7,f[z],  then  we  have 
Ui  [x]  <H  Ul[x]  <H  which  violate  our  as¬ 

sumption  that  ri[z]  is  the  immediate  preceding  con¬ 
flicting  operation  of  tWj[z].  Hence,  17*  [®]  t7r[z]. 

The  schedule  thus  must  contain  isi  <ff  tsj  <h 
tsk  <H  Uk[x]  <H  I7f[z]  <H  r/"[z],  which  again  vio¬ 
lates  the  stack-based  discipline.  □ 

Property  2  Transactions  scheduled  by  pure  locking 
xvill  never  be  aborted. 

Proof:  Follows  from  Cases  II  and  12  in  the  proof  of 
Property  1.  □ 

When  a  transaction  is  aborted,  we  must  remove 
its  effect  to  the  database  as  well  to  other  transac¬ 
tions.  If  the  protocol  were  not  designed  carefully,  this 
might  trigger  further  abortions,  often  termed  cascad¬ 
ing  abort. 

Property  3  Executions  produced  by  the  mixed  proto¬ 
col  are  cascadeless. 


Proof:  Follows  from  the  scheduling  rules  that  write 
locks  are  held  until  after  a  transaction  commits  or 
aborts,  □ 

Property  4  If  a  transaction  is  aborted,  then  it  is 
aborted  by  a  higher-priority  transaction. 

Proof:  Suppose  7}  must  be  aborted  because  its  oper¬ 
ation  qj  [z]  can  not  be  performed  because  a  conflicting 
operation  pj[z]  of  7)  has  been  executed  and  ts(7j)  > 
ts(7}).  We  want  to  show  that  Tj’s  priority  must  be 
greater  than  7}’s.  Since  ts(7i)  >  ts(7)),  tsj  <h  tSi. 
By  Observations  1  and  2,  the  schedule  thus  contains 
tsj  <H  tsi  <H  Uf[x]  <H  ?;[x].  If  Tj  has  higher 
priority  than  7),  then  Tj  is  blocked  by  lower-priority 
transaction  7)  after  7}  starts,  which  violates  the  stack- 
based  discipline.  Hence,  Tj’s  priority  must  bc^  higher 
than  7}’s.  □ 

Observation  4  When  transactions  Ti  and  Tj  contain 
two  conflicting  operations  PtM  nnd  qj[x],  respectively 
and  Ti ’s  priority  is  higher  than  Tj ’s,  then  the  worst- 
case  abortion  cost  for  Tj  by  Ti  due  to  conflict  on  x  is 
the  longest  execution  time  from  Tj ’s  initiation  up  to 
(but  not  including)  gj[z]. 

This  can  be  observed  in  the  proof  of  Property  4:  if 
qj[x]  has  been  executed  before  7)  preempts  7),  then 
Tj  will  not  be  aborted  by  7)  due  to  conflict  on  z.  We 
call  such  cost  prefix  abortion  cost  for  Tj.  If  Tj  and 
Tj  conflict  on  more  than  one  data  item,  we  need  only 
consider  the  largest  such  cost.  We  call  such  cost  A- 
cosiij. 

Notice  when  a  transaction  is  aborted,  due  to  delayed 
writes  there  is  no  need  to  undo  its  write  operations.  In 
addition,  there  is  no  need  to  undo  its  read  operations, 
i.e.,  to  recover  the  old  value  of  Rcur  for  each  data  item 
read  by  the  transaction.  Those  reads  can  be  thought 
of  as  “ghost”  operations  and  cause  no  harm  to  over¬ 
all  system  correctness.  As  a  result,  the  overheads  in 
aborting  a  transaction  are  very  small. 

Lemma  1  The  worst-case  abortion  cost  charged  to 
task  Tj  by  conflicting  task  Tj  is  x  A-costij. 

Proof:  Because  of  stack-based  discipline,  each  instance 
of  Tj  can  abort  Tj  at  most  once.  The  Lemma  thus 
follows  from  Observation  4.  □ 

We  define  HPCj  =  {n:  Tj  has  higher  priority  than 
Tj  and  Tj  conflicts  with  Tj}. 

Lemma  2  The  worst-case  abortion  cost  for  Tj  is 
^Ti^HPCj  ( r ^  A-COSt{j  ). 

Proof:  Follows  from  Lemma  1  and  Property  4.  O 


104 


5  Application  and  schedulability  anal¬ 
ysis 

When  performing  schedulability  analysis,  we  first 
assume  that  all  tasks  are  scheduled  by  pure  lock¬ 
ing.  If  ^dl  critical  tasks  are  schedulable,  there  is  no 
reason  to  consider  scheduling  tasks  by  abort/locking. 
Otherwise,  we  add  timestamp  management  overheads 
and  again  calculate  worst-cast  execution  times  and 
blocking  duration.  The  schedulability  analysis  will 
then  determine  which  tasks  must  be  scheduled  by 
abort/locking. 

Whenever  a  task  r,-  is  found  not  schedulable  due 
to  excessive  blocking  caused  by  a  lower  priority  task 
Tj ,  we  schedule  Tj  by  abort.  We  calculate  worst-case 
abortion  cost  for  Tj  based  on  Lemma  2.  This  abor¬ 
tion  cost  effectively  becomes  part  of  the  computation 
time  for  Tj .  We  then  re-evaluate  worst-case  blocking 
to  T,-  and  resume  the  schedulability  analysis.  Because 
the  new  blocking  caused  by  Tj  cannot  be  greater  than 
before,  schedulability  results  for  tasks  with  higher  pri¬ 
ority  than  Tf  will  not  be  affected. 

y 

An  example 

The  avionics  platform  example  in  [4]  has  18  pe¬ 
riodic  tasks  and  9  data  objects.  As  in  [4],  task 
WeaponJlelease  is  ordered  second  (not  in  pure  rate- 
monotonic  order)  to  meet  a  5  ms  jitter  requirement. 
We  assume:* 

•  timestamp  acquisition  takes  at  most  50  ps. 

•  each  update  of  Rcur  or  Wmax  takes  at  most  0.5  ps. 

•  each  read/write  of  a  datum  takes  at  most  0.5  ps. 

•  reading  Rcur  or  Wmax  and  comparing  it  to  a  trans¬ 
action’s  own  timestamp  takes  at  most  1  ps. 

•  set/release  a  lock  takes  at  most  50  ps. 

•  for  each  datum,  access  wList  takes  at  most  50  ps 

Table  1  shows  task  set  characteristics  from  [4]  but 
with  blocking  calculated  from  execution  times  of  whole 
tasks  rather  than  short  critical  sections.  The  schedu¬ 
lability  analysis  is  based  on  the  “critical  time  test” 
reported  in  [4].  Task  Weapon-Release  is  not  schedu¬ 
lable  in  a  pure  locking  protocol  due  to  its  stringent 
jitter  requirement  and  excessive  blocking  by  lower- 
priority  tasks,  even  though  cumulative  task  utilization 
is  only  6.15%.  To  schedule  Weapon-Release  and  meet 
its  jitter  requirement,  we  must  permit  abortion  of  each 

^  These  assumptions  are  based  on  simple  memory  accesses  for 
most  operations,  with  no  context  switches  or  operating  system 
services,  and  the  protocol  is  designed  to  make  such  an  imple¬ 
mentation  possible. 


lower-priority  task  that  reads  “DB”  and  whose  execu¬ 
tion  time  is  greater  than  1.  (If  the  jitter  requirement 
is  removed  as  in  [2],  then  all  the  tasks  will  be  schedula¬ 
ble  by  pure  locking.)  All  other  tasks  can  be  scheduled 
by  pure  locking. 

Table  2  illustrates  the  schedulability  calculations 
under  the  new  protocol  with  all  overheads  for  the  first 
six  tasks. 

6  Conclusion 

The  non-2-phase  protocol  described  above  can  im¬ 
prove  system  schedulability  while  maintaining  a  strong 
correctness  criterion,  i.e.,  serializability  (of  whole  tasks 
and  not  just  of  small  sequences  of  data  accesses).  It  is 
an  example  of  a  concurrency  control  protocol  suitable 
for  hard-real-time  systems,  and  is  (to  our  knowledge) 
the  first  example  of  a  protocol  that  may  abort  tasks 
and  yet  achieve  better  worst  case  schedulability  than 
a  pure  locking  protocol.  Additional  overheads  are  in¬ 
curred  for  timestamp  management  and  abortion,  but 
these  overheads  are  small  and  need  not  be  incurred  by 
all  tasks.  The  choice  of  which  tasks  require  timestamp 
management  and  possible  abortion  is  straightforward 
and  driven  by  schedulability  analysis.  The  general 
avionics  example  adapted  from  [4]  illustrates  a  class 
of  system  for  which  selective  abort  may  increase  the 
number  of  tasks  whose  worst-case  schedulability  can 
be  guaranteed. 
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Table  1:  Task  set  characteristics  adapted  from  the  generic  avionics  example  in  [4]. 


Task 

Period 

(ms) 

Exec. 

(ms) 

Read  set 

Write  set 

Abort? 

TimerJnierrupt 

1 

0.051 

0 

- 

- 

Weapon.Release 

200 

3 

9** 

- 

DB 

Radar.Tracking.Filter 

25 

2 

9 

DB.N 

D,DB.T 

Abort 

RWR.Contact  Jrlgmt 

25 

5 

9 

DB.N.K.W 

D.DB.T 

Abort 

PolLBus.Device 

40 

1 

9 

all 

- 

Weapon.Aim 

50 

3 

9 

N.T 

D,DB 

Radar.Target.U  pdate 

50 

5 

9 

DB.N.K 

D.DB.T 

Abort 

Nav.Update 

59 

8 

9 

DB.K.R 

D.DB.T.R.W.RW 

Abort 

Oisplay.Graphic 

80 

9 

5 

all 

DB 

Abort 

Display  .Hook.U  pdate 

80 

2 

5 

OB 

- 

Abort 

Tracking.Target.U  pd 

100 

5 

3 

DB,N.K.R.RW 

D.W 

Abort 

Weapon.Protocol 

200 

1 

3 

K 

DB 

Nav.Steering.Cmds 

200 

3 

3 

D 

D 

Display  .Stores.U pdate 

200 

1 

3 

W 

DB 

Display.Keyset 

200 

1 

3 

OB 

aU 

Display  .Stat.U  pdate 

200 

3 

1 

aU 

DB 

Abort 

BET.E.Statns.Update 

1000 

1 

1 

- 

D 

Nav.Statns 

1000 

1 

0 

OB 

D 

'Worst-case  blocking  based  on  treating  whole  tasks  as  transactioos,  as  ia  [2]  bat  b  coatrast  to  [4]. 

"Note  that  blocking  for  Weapon.Release  as  calcnUted  is  [4]  caa  be  ao  more  tkaa  1.745bis.  despite  coaikt  with  Dis¬ 
play  .Graphic  which  has  compotation  time  ?ms  and  reads  aad  writes  DB.  This  diScaky  is  avoided  ia  [2]  by  removiag  the 
jitter  requirement  for  Weapon.Release. 

/ 

Table  2:  Schedulability  calculations  for  the  mixed  loddng/abort  protocol. 


Task 

Period 

(ms) 

Exec. 

(ms) 

Bloddag 

(»«) 

Abort 

Schcdslabiiity  Test 

TimerJnterrupt 

1 

0.051 

0 

0 

.051  <  1 

Weapon.Release 

1.0545 

0 

1 

5/11  X  -F  3.0SOS -F  I.0S45  w  4  <  5 

Radar.Tracking.Filter 

25 

3.052 

0.253 

1 

»/il  X  0.051  -F  (25/200]  x  3.0505-F 
2S/2S]  X  2J435  -F  3JS2  >  9.721  <  25 

RWR.Contact  J4gml 

25 

5.0935 

3.052 

5.57B 

25/1]  X  0.051  -F  ^25/200)  x  3.050Sd- 
25/25]  X  2J435HK  (25/25]  x  10.5715+ 
1.052  St  20.3925  <  25 

Poll.BusJDevice 

40 

1.0545 

3.052 

■ 

40/l]  X  .051  +  (40/200]  x  3.0505-F 

40/25]  X  (2J435  +  10.6715)+ 

40/40]  X  1.0545  +  3.052  w  35.227  <  40 

Weapon.Aim 

50 

3.052 

3.051 

0 

50/l]  X  .051  +  (50/200]  X  3.0505+ 

50/25]  X  (2J435  +  10.6715)+ 

50/40]  X  1.0545+  (50/50]  x  3.052+ 
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Abstract 


Two  widely-studied  approaches  for  structuring  fault- 
tolerant  services  are  the  state-machine  and  the  primary- 
backup  replication  schemes.  For  a  large  class  of  soft 
and  hard  real-time  applications,  the  degree  of  consistency 
among  servers  can  be  exploited  to  design  replication  proto¬ 
cols  with  predictable  timing  behavior.  This  is  particularly 
useful  in  applications,  such  as  automated  process  control, 
in  which  one  can  tradeoff  the  quality  or  precision  for  timely 
availability  of  data. 

This  paper  presents  the  architecture  and  prototype  im¬ 
plementation  of  a  primary-backup  repiication  sertiice  that 
employs  window  consistency  semantics  between  the  pri¬ 
mary  data  repository  and  the  backups.  A  client  registers 
a  data  object  with  the  service  by  declaring  the  consistency 
requirements  for  the  data,  in  tfrrms  of  a  time  window.  The 
primary  ensures  that  each  backup  site  maintains  a  version 
of  the  object  that  was  valid  on  the  primary  within  the  pre¬ 
ceding  time  window  by  scheduling  update  messages  to  the 
backups. 

Decoupling  the  transmission  of  updates  to  the  backups 
from  the  processing  of  ciicnf  requests  permits  the  primary 
to  handle  a  higher  rate  of  operations  and  provide  more 
timely  service  to  clients.  The  non-blocking  semantics  free 
the  client  from  waiting  for  updates  to  the  backups  to  com¬ 
plete.  Furthermore,  real-time  scheduling  of  update  mes¬ 
sages  can  guarantee  controlled  inconsistency  between  the 
primary  and  backup  repositories. 


•The  work  reported  in  this  paper  was  supported  in  part  by 
the  National  Science  Foundation  under  Grant  MIP-9203895. 
Any  opinions,  (hidings,  and  conclusions  or  recommendations 
expressed  in  this  paper  are  those  of  the  authors  and  do  not  nec¬ 
essarily  reflect  the  view  of  the  NSF.  Also  supported  by  a  grant 
from  the  Rackliam  School  of  Graduate  Studies  at  the  University 
of  Michigan. 


1  Introduction 

A  common  approach  to  building  fault-tolerant  dis¬ 
tributed  systems  is  to  replicate  servers  that  fail  in¬ 
dependently.  The  objective  is  to  give  the  clients  the 
illusion  of  a  service  that  is  provided  by  a  single  server. 
Two  widely-studied  approaches  for  structuring  fault- 
tolerant  services  are  the  primary-backup  and  the  state- 
machine  replication  schemes.  In  both  approaches,  a 
modify  request  from  a  client  results  in  the  execution  of 
an  agreement  protocol  that  ensures  consistency  among 
the  replicated  servers.  For  example,  in  the  traditional 
primary-backup  model,  each  client  modify  request  to 
the  primary  repository  requires  an  update  transmis¬ 
sion  to  the  backup.  This  approach  artificially  ties 
the  rate  of  client  modify  operations  to  the  rate  of  up¬ 
dates  to  the  backups,  limiting  both  response  time  pre¬ 
dictability  and  total  system  throughput  while  ensuring 
consistent  data  after  fail-over.  For  a  large  class  of  soft 
and  hard  real-time  applications,  this  restriction  can  be 
more  detrimental  than  having  a  slightly  stale  copy  of 
the  data  on  the  backups. 

This  paper  presents  the  architecture  of  a  new 
primary-backup  scheme,  referred  to  as  the  window- 
consistent  replication  service  [1],  which  exploits  the 
ability  of  many  real-time  applications  to  tolerate  con¬ 
trolled  time-inconsistency  of  the  repository  to  pro¬ 
vide  timely  availability  of  this  data.  The  notion  of 
window  consistency  releixes  atomic  or  causal  consis¬ 
tency  among  replicas  to  obtain  less  expensive  repli¬ 
cation  protocols.  In  particular,  the  proposed  replica¬ 
tion  scheme  exploits  temporal  constraints  on  objects  to 
maintain  a  less  current  but  acceptable  version  of  the 
primary  data  on  the  backup.  A  client  registers  a  data 
object  with  the  service  by  declaring  the  consistency 
requirements  for  the  data,  in  terms  of  a  time  window. 
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The  primary  ensures  that  each  backup  site  maintains 
a  version  of  the  object  that  was  valid  on  the  primary 
within  the  preceding  time  window  by  scheduling  se¬ 
lective  update  messages  to  the  backups.  The  time 
window  essentially  establishes  a  bounded  distance  be¬ 
tween  the  primary  and  the  backup  states  for  each  ob¬ 
ject.  The  following  example  illustrates  the  motivation 
for  the  proposed  approach. 

Example  -  Highly-Available  Process  Control 
System:  Consider  a  primary-backup  system  for  au¬ 
tomated  manufacturing  and  process  control  applica¬ 
tions,  as  shown  in  Figure  1.  The  primary  and  the 
backup  nodes  share  external  devices  such  as  sensors. 
The  primary  runs  in  a  tight  loop  sampling  sensors, 
calculating  new  values,  and  sending  signal  to  exter¬ 
nal  I/O  under  its  control.  The  primary  also  maintains 
an  in-memory  data  repository  which  is  updated  fre¬ 
quently  during  each  iteration  of  the  tight  control-loop. 
One  of  the  requirements  on  the  system  is  to  be  able 
to  switch  to  the  backup  in  case  of  the  primary  failure 
within  a  few  hundred  milliseconds. 

The  in-memory  data  repository  must  be  replicated 
on  the  backup  to  meet  the  strict  timing  constraint  on 
the  switch-over.  Since  there  can  be  hundreds  of  up¬ 
dates  to  the  data  repository  during  each  iteration  of 
the  control  loop,  it  is  impractical  (and  perhaps  impos¬ 
sible)  to  update  the  backup  synchronously  each  time 
the  primary  copy  is  changed.  An  alternative  solution 
is  to  exploit  the  data  semantics  in  a  process  control 
system  by  allowing  the  backup  to  maintain  a  less  cur¬ 
rent  but  an  acceptable  copy  of  the  data  that  resides 
on  the  primary.  If  the  data  on  the  backup  does  not 
fall  too  far  behind  the  version  on  the  primary,  the 
backup  can  recover  from  a  primary  failure.  For  ex¬ 
ample,  updates  can  be  sent  in  batches  or  selectively 
to  the  backup.  If  the  primary  fails,  the  backup  can 
take  over  even  if  the  last  few  updates  are  lost.  The 
objective  is,  however,  to  keep  the  backup  data  recent 
such  that  it  can  reconstruct  a  consistent  system  state 
by  extrapolating  from  previous  values  and  by  reading 
new  sensor  values.  However,  one  must  ensure  that  the 
distance  between  the  primary  and  the  backup  copies 
is  bounded  within  a  predefined  time.  In  fact,  differ¬ 
ent  objects  may  have  distinct  tolerances  in  how  far 
the  backup  can  lag  behind  before  the  object  state  be¬ 
comes  stale.  The  challenge  is  to  bound  the  distance 
between  the  primary  and  the  backup  such  that  consis¬ 
tency  is  not  compromised  while  minimizing  the  over¬ 
head  in  exchanging  messages  between  the  primary  and 
its  backup.  Under  transient  overload  conditions,  the 
system  can  gracefully  degrade  by  allowing  the  backup 
to  increase  its  distance  from  the  primary.  □ 
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Figure  1:  Primary-backup  process  control  system 

A  replication  scheme  based  on  window  consistency 
allows  computations  that  may  otherwise  be  disallowed 
by  existing  active  or  passive  protocols  that  ensure 
atomic  updates  to  a  collection  of  replicas.  Enforcing 
a  weaker  correctness  criterion  has  been  studied  ex¬ 
tensively  for  different  purposes  and  application  areas. 
In  particular,  a  number  of  researchers  have  observed 
in  the  past  that  the  notion  of  serializability  is  too 
strict  a  correctness  criterion  for  real-time  databases. 
Hence,  several  alternatives  have  been  proposed  that 
eliminate  or  relax  serializability  as  a  correctness  cri¬ 
teria  for  managing  consistency  in  real-time  transac¬ 
tions.  Among  these  are  e-serializability  [2],  similar¬ 
ity  [3], temporal  and  external  consistency  [4],  triggered 
real-time  databases  [5].  The  above  correctness  crite¬ 
ria  allow  more  concurrency  by  supporting  a  limited 
amount  of  inconsistency  in  how  a  transaction  views 
the  database  state.  The  idea  of  imprecise  computa¬ 
tion  is  an  interesting  related  approach  that  sacrifices 
accuracy  for  timeliness  in  real-time  computations  [6]. 
Exploiting  weak  consistency  to  obtain  better  perfor¬ 
mance  has  also  been  proposed  in  other  non-real-time 
applications.  For  instance,  in  [7],  the  notion  of  quasi¬ 
copy  is  introduced  which  allows  a  weaker  type  of  con¬ 
sistency  between  the  central  data  and  its  cached  copies 
at  remote  sites.  The  objective  is  to  allow  a  cached  copy 
to  deviate  from  the  central  copy  in  a  controlled  way 
so  that  a  scheduler  has  more  fiexibility  in  propagating 
updates.  In  the  same  spirit,  the  notion  of  window  con¬ 
sistency  provides  controlled  inconsistency  among  the 
replicas  in  a  real-time  application  to  support  fault- 
tolerance. 


2  Window  Consistency 

The  window-consistent  replication  service  consists  of 
a  primary  and  one  or  more  backups,  with  the  data 
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Figure  2:  Window-consistent  primary-backup  archi¬ 
tecture 

on  the  primary  shadowed  at  each  backup.  These  sites 
store  collections  of  objects  which  change  over  time,  in 
response  to  client  interaction  with  the  primary.  The 
primary  handles  client  requests  and  ensures  that  each 
backup  repository  maintains  a  sufficiently  recent  ver¬ 
sion  of  the  objects.  In  the  absence  of  any  failures, 
the  primary  satisfies  all  client  requests  and  supplies  a 
data-consistent  repository.  If  the  primary  crashes,  a 
window-consistent  backup  performs  a  fail-over  to  be¬ 
come  the  new  primary  by  notifying  the  clients  and 
other  replicas  of  the  new  configuration.  Service  avail¬ 
ability  hinges  on  the  existence  of  a  window-consistent 
backup  to  replace  a  failed  primary. 

The  service  exports  two  sets  of  operations  to  the 
clients,  n2unely,  control  operations  and  data  opera¬ 
tions,  as  shown  in  Figure  2.  Control  operations,  such 
as  object  creation  and  deletion,  are  handled  by  the 
primary,  and  require  complete  agreement  within  the 
replication  service.  Client  query  and  modify  oper¬ 
ations  on  the  data  are  handled  locally  at  the  pri¬ 
mary,  without  triggering  interaction  with  the  back¬ 
ups.  The  primary  concurrently  handles  client  requests 
and  schedules  selective  updates  to  the  backup  repos¬ 
itories  to  guarantee  the  window  consistency  require¬ 
ments  of  the  data  objects.  The  primary  object  man¬ 
ager  (OM)  satisfies  client  data  requests,  while  sending 
update  messages  to  the  backup  at  the  behest  of  the 
primary  update  scheduler  (US). 

Whenever  the  primary  P  modifies  an  object,  P 
timestamps  the  new  version;  these  timestamps  iden¬ 
tify  successive  versions  of  the  object  on  the  primary. 
The  primary  schedules  transmissions  to  the  backups  to 
ensure  that  each  backup  has  a  sufficiently  recent  ver¬ 
sion  of  each  object.  At  time  t  the  primary  has  a  copy 
of  object  Oi  with  timestamp  t/’(<),  while  a  backup  B 


Figure  3:  Window  consistency  semantics 


stores  a,  possibly  older,  version  r,^(t).  While  B  may 
have  an  older  version  of  O,  than  P,  the  copy  on  B  must 
be  “recent  enough.”  The  service  must  ensure  that  B 
always  believes  in  data  that  was  valid  on  P  within  the 
last  6i  time  units.  The  value  of  6,-  is  set  during  object 
creation,  based  on  the  time-consistency  requirements 
of  the  application;  P  rejects  the  registration  request 
if  it  cannot  satisfy  the  object’s  window  consistency 
requirements. 

B  has  a  window- consistent  copy  of  object  O,  at  time 
t  if  and  only  if  r/’(t  -  6i)  <  T^{t)  <  r/’(<).  For  ex¬ 
ample,  in  Figure  3,  P  performs  several  operations  on 
Oi,  on  behalf  of  client  requests,  but  selectively  trans¬ 
mits  update  messages  to  fl.  At  time  t  the  primary 
has  the  most  recent  version  of  the  object,  which  was 
timestamped  at  time  <2;  the  backup  has  a  version  first 
recorded  on  the  primary  at  time  ti.  Thus,  r/’(t)  =  tj, 
if(t)  =  ti,  and  t/* (t  -  Si)  =  to-  Since  to  <ti  <  <2.  P 
has  a  window-consistent  version  of  Oi  at  time  t.  The 
primary  US  employs  real-time  scheduling  algorithms 
to  coordinate  the  selective  updates  for  the  various  ob¬ 
jects. 

The  primruy  P  needs  an  accurate  measure  of  the 
window  consistency  of  the  backup  objects  to  deter¬ 
mine  the  operating  mode  of  the  backup  sites.  Al¬ 
though  P  cannot  know  the  exact  value  of  t;®  at 
all  times,  the  primary  can  estimate  the  state  of  the 
backup.  In  particular,  P  does  know  the  time  tf””’ 
when  it  last  transmitted  jui  update  for  Oi.  The  pri¬ 
mary  sent  version  rf"”*  =  of  the  object  to 

the  backup,  so  P  knows  r®  <  rf""*.  However,  if  P 
has  not  received  an  acknowledgement  for  this  update, 
P  cannot  guarantee  that  B  has  this  copy  of  Oi  yet.  If 
the  backups  tell  P  what  object  versions  they  have  re¬ 
ceived,  then  the  primary  knows  the  latest  version  rf** 
that  P  has  seen  acknowledged  by  B.  The  primary 
then  knows  that  r®'*  <  r,®.  Using  r*””*  and  P 
can  provide  both  optimistic  and  pessimistic  measures 
of  the  window  consistency  of  Oi  on  B. 

A  backup  also  needs  some  measure  of  its  own  win¬ 
dow  consistency  to  determine  whether  it  is  a  suit¬ 
able  substitute  for  a  failed  primary.  Since  a  window- 


inconsistent  backup  B  cannot  supplant  a  crashed  pri¬ 
mary,  B  cannot  tolerate  long  periods  without  hearing 
from  P.  The  backups  must  balance  the  likelihood  of 
false  failure  detection  with  the  possibility  of  having  no 
window-consistent  backup  to  replace  the  crashed  pri¬ 
mary.  Although  B  may  be  unaware  of  recent  client 
interaction  with  P  for  each  object,  B  does  know 
and  the  time  tf”“'  when  P  transmitted  this  update. 
If  less  than  6,  time  units  have  elapsed  since  P  sent  this 
update,  then  B  h2is  data  that  P  believed  within  the 
last  time  units. 

3  Application  of  Real-Time  Schedul¬ 
ing 

The  window-consistent  replication  model  satisfies  a 
high  rate  of  client  query  and  modify  operations  by 
relaxing  the  consistency  constraints  between  the  pri¬ 
mary  and  the  backups.  Under  the  limitations  of  finite 
processing  time  and  network  bandwidth,  however,  the 
primary  must  schedule  the  selective  updates  to  the 
backup  sites.  By  casting  the  transmissions  of  updates 
as  tasks,  the  primary  US  can  draw  upon  real-time  task 
scheduling  algorithms.  While  several  task  models  can 
accommodate  window-consistent  scheduling,  we  ini¬ 
tially  consider  the  periodic  task  model  [8,9]. 

With  the  periodic  model,  the  primary  coordinates 
transmissions  to  the  backups  by  scheduling  an  update 
task  with  period  pi  and  service  time  e^  for  each  object 
Oi  *.  The  end  of  a  period  serves  as  both  the  dead¬ 
line  for  one  invocation  of  the  task  and  the  arrival  time 
for  the  subsequent  invocation.  The  scheduler  always 
runs  the  ready  task  with  the  highest  priority,  preempt¬ 
ing  execution  if  a  higher-priority  task  arrives.  Rate- 
monotonic  scheduling  statically  assigns  higher  priority 
to  tasks  with  shorter  periods  [8,9],  while  earliest-due- 
date  scheduling  favors  tasks  with  earlier  deadlines  [8]. 

Given  a  schedulable  set  of  tasks,  the  primary  US 
ensures  that  Oi  is  sent  to  the  backup  once  per  period 
Pi ,  resulting  in  a  maximum  time  of  2pi  between  succes¬ 
sive  transmissions  of  Oi-  The  replication  service  must 
consider  object  consistency  requirements  and  trans¬ 
mission  delays  in  determining  p,-  for  each  object.  In 
the  absence  of  a  link  failure,  we  assume  a  bound  d  on 
the  end-to-end  latency  between  the  primary  and  the 
backups.  If  a  client  operation  modifies  Oi,  the  pri¬ 
mary  must  send  an  update  for  the  object  within  the 

'The  size  of  Oi  determines  the  time  et  required  for  each 
update  transmission.  In  order  to  accommodate  preemptive 
scheduling  and  objects  of  various  sizes,  the  primary  can  send 
an  update  message  as  one  or  more  fixed-length  packets. 
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(a)  Periodic  schedule; 

(Pl  =  5,  Cl  =  2,  P2  =  3,  62  =  1) 
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(b)  Compressed  periodic  schedule: 

(pi  =  5,  ei  =  2,  p2  =  3,  62  =  1) 

Figure  4;  Compressing  a  periodic  schedule 

next  Si  —  d  time  units;  otherwise,  the  backups  may 
not  receive  a  sufficiently  recent  version  of  O,  before 
the  time-window  6,-  elapses.  For  window  consistency 
this  permits  a  maximum  period  pi  =  {Si  —  d)/2. 
Compressing  the  periodic  schedule:  While  the 
periodic  model  can  guarantee  sufficient  updates  for 
each  object,  the  schedule  updates  Oi  only  once  per 
period  pi ,  even  if  computation  and  network  resources 
permit  more  frequent  transmissions.  This  restriction 
arises  because  the  periodic  model  assumes  that  a  task 
becomes  ready  to  run  only  at  period  boundaries.  How¬ 
ever,  the  primary  can  transmit  the  current  version  of 
an  object  at  any  time.  The  scheduler  can  capitalize  on 
this  task  readiness  to  improve  both  resource  utilization 
and  the  window  consistency  on  the  backups  by  cem- 
pressing  the  periodic  schedule.  Consider  two  objects 
Oi  and  O2  as  depicted  in  Figure  4.  The  scheduler 
must  send  an  update  requiring  1  unit  of  processing 
time  once  every  3  time  units  (unshaded  box)  and  an 
update  requiring  2  units  of  processing  time  once  every 
5  time  units  (shaded  box).  For  this  example,  both  the 
rate-monotonic  and  earliest-due-date  algorithms  gen¬ 
erate  the  schedule  shown  in  Figure  4(a).  While  each 
update  is  sent  as  required  in  the  major  cycle  of  length 
15,  the  schedule  has  4  units  of  slack  time.  The  peri¬ 
odic  schedule  can  provide  the  order  of  task  executions 
without  restricting  the  time  the  tasks  become  active. 
If  no  tasks  are  ready  to  run,  the  scheduler  can  advance 
to  the  earliest  pending  task  and  activate  that  task  by 
advancing  the  logical  time  to  the  start  of  the  next  pe¬ 
riod  for  that  object.  With  the  compressed  schedule 
the  primary  still  transmits  an  update  for  each  O,  at 
least  once  per  period  pi  but  can  send  more  frequent 
updates  when  time  allows.  As  shown  in  Figure  4(b), 
compressing  the  slack  time  allows  the  schedule  to  start 
over  at  time  11.  In  the  worst  case,  the  compressed 
schedule  degrades  to  the  standard  periodic  schedule 
with  the  Ekssociated  guarantees.  Integrating  a  new 
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backup:  To  minimize  the  time  the  service  operates 
without  a  window-consistent  backup,  the  primary  P 
needs  an  efficient  mechanism  to  integrate  a  new  or 
invalid  backup.  P  must  send  the  new  backup  B  a 
copy  of  each  object  and  then  transition  to  the  peri¬ 
odic  schedule  to  sustain  B.  B  must  receive  a  copy 
of  Oi  in  the  “period”  pi  before  the  periodic  schedule 
begins;  this  ensures  that  B  can  afford  to  wait  until 
the  next  pi  interval  to  start  receiving  periodic  update 
messages  for  Oi .  P  guarantees  a  smooth  transition  to 
the  periodic  schedule  by  sending  the  objects  to  B  in 
reverse  period  order,  such  that  the  objects  with  larger 
periods  are  sent  before  those  with  smaller  periods.  For 
object  Oi,  this  ensures  that  only  objects  with  smaller 
or  equivalent  periods  can  follow  Oi  in  the  integration 
schedule;  these  same  objects  can  precede  Oi  in  the 
periodic  schedule.  This  guarantees  that  the  integra¬ 
tion  schedule  transmits  O,  no  more  than  pi  time  units 
before  the  start  of  the  periodic  schedule,  ensuring  a 
consistent  transition.  The  reverse-period  integration 
schedule  transmits  a  single  copy  of  each  object,  min¬ 
imizing  the  time  required  to  establish  window  consis¬ 
tency  on  a  new  backup. 

4  Prototype  Implementation  and  On¬ 
going  Work 

We  are  developing  a  prototype  implementation  of  the 
window-consistent  replication  service  to  demonstrate 
and  evaluate  the  proposed  service  model.  Each  client 
or  repository  site  is  currently  a  Sun  SPARCstation 
running  Solaris  1.1.  The  sites  communicate  through 
UDP  datagrams  using  the  Sockat-f-t-  library  from  the 
University  of  Virginia,  with  extensions  for  priority- 
based  access  to  the  active  sockets.  The  prototype 
implements  rate-monotonic  scheduling  with  compres¬ 
sion.  While  Solans  1.1  provides  a  stable  environment 
for  code  development  and  testing,  the  platform  does 
not  support  real-time  thread  scheduling,  bounded 
communication  delays,  or  synchronized  clocks.  After 
initial  code  development  and  testing,  we  will  evaluate 
the  service  in  a  real-time  distributed  system  [10]. 

The  prototype  provides  a  general  framework  for 
comparing  the  performance  of  different  scheduling  al¬ 
gorithms  for  coordinating  update  transmissions  to  the 
backups.  We  are  currently  investigating  the  distance- 
constrained  task  model  [11]  which  assigns  priorities 
based  on  separation  restrictions.  In  addition,  we  are 
considering  adaptive  scheduling  algorithms  that  incor¬ 
porate  knowledge  of  recent  client  interaction  with  the 
primary.  These  alternative  models  may  permit  more 


optimistic  schedulability  criteria  (with  some  increased 
cost  in  scheduler  complexity),  allowing  the  replication 
service  to  accept  objects  with  more  demanding  win¬ 
dow  consistency  requirements. 

Window  consistency  offers  a  framework  for  design¬ 
ing  replication  protocols  with  predictable  timing  be¬ 
havior.  By  decoupling  communication  within  the  ser¬ 
vice  from  the  handling  of  client  requests,  a  replica¬ 
tion  protocol  can  handle  a  higher  rate  of  query  and 
modify  operations  and  provide  more  timely  response 
to  clients.  Scheduling  the  selective  communication 
within  the  service  provides  bounds  on  the  degree  of 
inconsistency  between  servers.  Although  the  current 
prototype  implements  the  primary-backup  replication 
model,  we  are  exploring  the  application  of  window 
consistency  to  the  state-machine  approach  to  server 
replication.  In  addition,  we  are  investigating  the  in¬ 
fluence  of  controlled  time-inconsistency  on  failure  de¬ 
tection  and  recovery  in  replication  protocols. 
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Abstract 

In  [4],  Graham  proposes  several  conditions  which 
are  sufficient  to  guarantee  that  a  transaction  system 
will  run  serializably  without  any  extra  effort  having 
to  be  taken.  Systems  satisfying  these  conditions  are 
said  to  achieve  serializability  for  free.  The  conditions 
considered  by  Graham  are  determined  by  a  syntactic 
check  on  the  transaction  programs,  and  are  indepen¬ 
dent  of  the  semantics  of  data.  In  this  paper,  we  use  a 
semantic  approach  and  propose  a  sufficient  condition 
for  achieving  data  synchronization  for  free  which  is 
based  on  the  concept  of  data  similarity  [5].  Real-time 
transactions  satisfying  this  condition  can  be  scheduled 
correctly  by  any  process  scheduling  discipline  that  is 
designed  for  the  independent  processes  model  [8],  e.g., 
RMS,  EDF,  where  no  locking  of  data  is  assumed.  The 
correctness  of  our  approach  is  justified  by  exploiting 
the  idea  of  A-serializability. 

1  Introduction 

There  has  been  substantial  interest  in  the  profer- 
mance  of  transaction  systems  which  have  significant 
response  time  requirements.  These  requirements  are 
usually  posed  as  deadlines  on  individual  transactions. 
A  scheduling  algorithm  must  attempt  to  meet  dead¬ 
lines  as  well  as  preserve  database  consistency.  Due  to 
priority  inversion  caused  by  data  access,  a  transaction 
system  with  moderate  utilization  factor  is  often  hard 
to  schedule.  In  [5],  we  explored  a  weaker  correctness 
criterion  for  concurrency  control  in  real-time  transac¬ 
tions,  namely,  the  notion  of  “similarity” ,  so  that  the 
schedulability  problem  is  relaxed  by  the  flexibility  in 
scheduling  read /write  events  introduced  by  the  notion 
of  similarity. 

**  Supported  in  part  by  a  research  grant  from  the  Office  of 
Naval  Research  under  ONR  contract  number  N0(X)14-89-J-1472 

*  Notice  that  POP  and  SRP  are  originally  designed  for  single 
processor  systems.  Hence  comparing  them  with  SSP  may  not 


In  [6],  we  proposed  a  class  of  real-time,  data-access 
protocols  called  SSP  (Similarity  Stack  Protocol)  which 
is  based  on  the  notion  of  similarity.  Transactions  are 
allowed  to  run  concurrently  if  their  conflicting  events 
are  strongly  similar.  We  gave  a  schedulability  bound 
for  SSP  and  also  reported  some  encouraging  simula¬ 
tion  results.  Although  SSP  was  shown  to  compare  fa¬ 
vorably  with  Priority  Ceiling  Protocol  (PCP)  [11]  and 
Stack  Resource  Policy  (SRP)  [2]  especially  on  multi¬ 
processor  systems* ,  a  transaction  system  with  a  mod¬ 
erate  workload  is  still  hard  to  schedule.  In  this  paper, 
we  further  explore  the  notion  of  similarity  and  pro¬ 
pose  a  sufficient  condition,  so  that  a  real-time  trans- 
eu:tion  satisfying  this  condition  can  be  scheduled  cor¬ 
rectly  by  an  independent  process  scheduling  algorithm 
such  as  RMS  or  EDF  [8].  This  means  that  the  usually 
high  utilization  fjictor  that  can  be  achieved  by  these 
scheduling  algorithms  is  also  attainable  for  transac¬ 
tions  satisfying  our  condition. 

The  rest  of  the  paper  is  organized  as  follows.  Sec¬ 
tion  2  summarizes  the  similarity  concept  and  the  cor¬ 
rectness  criterion  in  [5].  Section  3  describes  a  sufficient 
condition  with  which  transactions  can  be  scheduled 
independently.  Section  4  further  extends  the  results 
presented  in  section  3.  Section  5  is  the  conclusion. 

2  Data  Similarity 

2.1  Database  Model 

The  state  of  a  real-time  system  is  represented  by  the 
values  of  a  collection  of  data  objects.  Each  data  object 
takes  its  value  from  its  domain.  Events  are  primitive 
data  operations  (atomic  reeid  or  write)  which  may  oc¬ 
cur  many  times  in  a  computation.  Each  instance  of 
an  event  is  associated  with  a  time  stamp  whose  value 
is  the  (wall-clock)  time  at  which  the  event  instance 
occurs.  A  transaction  is  a  partial  order  of  events.  An 
instance  of  a  transaction  is  scheduled  for  every  request 
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for  the  transaction.  To  distinguish  between  a  transac¬ 
tion  and  an  instance  of  it,  we  shall  use  the  notation 
Ti  j  to  denote  the  jth  instance  of  transaction  .  The 
view  of  a  transaction  (instance)  is  a  vector  of  data  ob¬ 
ject  values  such  that  the  jth  component  is  the  value 
read  by  the  «th  read  event  of  the  transaction  instance 
[10]. 

A  schedule  over  a  set  of  transactions  is  a  partial  or¬ 
der  of  event  instances  issued  by  instances  of  the  trans¬ 
action  set.  Each  event  instance  in  a  schedule  is  issued 
by  one  transaction  instance.  The  ordering  of  event 
instances  in  a  schedule  must  be  consistent  with  the 
event  ordering  as  specified  by  the  transaction  set.  In 
a  real-time  computation,  the  partial  ordering  of  event 
instances  in  a  schedule  is  induced  by  the  time  stamps 
of  event  instances  at  different  sites.  A  serial  schedule 
is  a  sequence  of  transaction  instances  (i.e.,  a  schedule 
in  which  the  transaction  instances  are  totally  ordered) . 

2.2  Similarity 

The  value  of  a  data  object  that  models  an  entity  in 
the  real  world  cannot  in  general  be  updated  continu¬ 
ously  to  perfectly  track  the  dynamics  of  the  real-world 
entity.  The  time  needed  to  perform  an  update  alone 
necessarily  introduces  a  time  delay  which  means  that 
the  value  of  a  data  object  cannot  be  instantaneously 
the  same  as  the  corresponding  real-world  entity.  For¬ 
tunately,  it  is  often  unnecessary  for  data  values  to  be 
perfectly  up-to-date  or  precise  to  be  useful.  In  partic¬ 
ular,  data  values  of  a  data  object  that  are  slightly 
different  are  often  interchangeable  as  read  data  for 
transactions.  This  observation  underlies  the  concept 
of  similarity  among  data  values. 

Similarity  is  a  binary  relation  on  the  domain  of  a 
data  object.  Every  similarity  relation  is  reflexive  and 
symmetric,  but  not  necessarily  transitive.  Different 
transactions  can  have  different  similarity  relations  on 
the  same  data  object  domain.  Two  views  of  a  trans¬ 
action  are  similar  iff  every  read  event  in  both  views 
uses  similar  values  with  respect  to  the  transaction. 
We  say  that  two  values  of  a  data  object  are  similar  if 
all  transactions  which  may  read  them  consider  them 
as  similar.  In  a  schedule,  we  say  that  two  event  in¬ 
stances  are  similar  if  they  are  of  the  same  type  and 
access  similar  values  of  the  same  data  object.  We  say 
that  two  database  states  are  similar  if  the  correspond¬ 
ing  values  of  every  data  object  in  the  two  states  are 
similar. 

A  minimal  restriction  on  the  similarity  relation  that 
makes  it  interesting  for  concurrency  control  is  the  re¬ 
quirement  that  it  is  preserved  by  every  transaction, 
i.e.,  if  a  transaction  T  maps  database  state  s  to  state 


t  and  state  s'  to  then  t  and  t'  are  similar  if  .s  and  s' 
are  similar.  We  say  that  a  similarity  relation  is  regular 
if  it  is  preserved  by  all  transactions.  We  are  interested 
in  regular  similarity  relations  only. 

2.3  Strong  Similarity 

The  definition  of  regular  similarity  only  requires  a 
similarity  relation  to  be  preserved  by  every  transac¬ 
tion,  so  that  the  input  value  of  a  transaction  can  be 
swapped  with  another  in  a  schedule  if  the  two  values 
are  related  by  a  regular  similarity  relation.  Unless  a 
similarity  relation  is  also  transitive,  in  which  case  it 
is  an  equivalence  relation,  it  is  in  general  incorrect  to 
swap  events  an  arbitrary  number  of  times  in  a  sched¬ 
ule. 

The  notion  of  strong  similarity  was  introduced  in  [5] 
which  has  the  property  that  swapping  similar  events 
in  a  schedule  will  always  preserve  similarity  in  the 
output.  This  notion  is  motivated  by  the  observation 
that  the  state  information  of  many  real-time  systems 
is  “volatile”,  i.e.,  they  are  designed  in  such  a  way  that 
system  state  is  determined  completely  by  the  history 
of  the  recent  past,  e  g.,  the  velocity  and  acceleration 
of  a  vehicle  are  computed  from  the  last  several  values 
of  the  vehicle’s  position  from  the  position  sensor.  Un¬ 
less  events  in  a  schedule  may  be  swapped  in  such  a 
way  that  a  transaction  reads  a  value  that  is  derived 
from  the  composition  of  a  long  chain  of  transactions 
that  extends  way  into  the  peist,  a  suitable  similarity 
relation  may  be  chosen  such  that  output  similarity  is 
preserved  by  limiting  the  “distance”  between  inputs 
that  may  be  read  by  a  transaction  before  and  after 
swapping  similar  events  in  a  schedule. 

For  the  purpose  of  this  paper,  it  suffices  to  note  that 
if  two  events  in  a  schedule  are  strongly  similar  (i.e., 
they  are  either  both  writes  or  both  reeids,  and  the  two 
data  values  involved  are  strongly  similar),  then  they 
can  always  be  swapped  in  a  schedule  without  violating 
data  consistency  requirements. 

3  A  Sufficient  Condition 

Similarity  is  an  inherently  application-dependent 
concept,  and  we  expect  the  application  engineer  to  de¬ 
fine  it  for  specific  applications.  In  many  real-time  ap¬ 
plications,  it  is  often  acceptable  to  use  an  older  value  of 
a  sensor  as  input  to  a  calculation,  instead  of  waiting  for 
a  more  up-to-date  value.  This  is  possible  because  the 
physics  of  the  application  may  be  such  that  changes 
in  sensor  reading  over  a  short  interval  of  time  are  so 
small  as  to  be  insignificant  to  the  calculation.  This 
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observation  provides  us  with  the  needed  connection 
between  similarity  and  timing  constraints  governing 
data  access. 

Specifically,  we  assume  that  the  application  seman¬ 
tics  allows  us  to  derive  a  similarity  bound  for  each  data 
object  such  that  two  write  events  on  the  data  object 
must  be  strongly  similar  if  their  time-stamps  differ  by 
an  amount  no  greater  than  the  similarity  bound,  i.e., 
all  instances  of  write  events  on  the  same  object  that 
occur  in  any  interval  shorter  than  the  similarity  bound 
can  be  swapped  in  the  (untimed)  schedule  without  vi¬ 
olating  consistency  requirements.  Notice  that  the  ex¬ 
istence  of  a  similarity  bound  does  not  imply  that  the 
similarity  relation  is  transitive,  since  event  swapping 
is  based  on  (wall-clock)  time  values  and  not  on  the 
relative  positions  of  events  in  a  schedule. 

3.1  Basic  Idea 

The  basic  idea  is  that  transactions  should  not  block 
one  another  as  long  as  meeting  timing  constraints 
guarantees  the  strong  similarity  of  their  conflicting 
events.  The  event  conflicts  are  resolved  by  appeal¬ 
ing  to  the  similarity  bound  in  the  following  discussion 
which  refers  to  Figure  1. 


no  larger  than  similarity  bound 


Oona  to  largar  than  similarity  bound 

Figure  1:  similarity  of  conflicting  events 

Suppose  two  events  ei  and  62  conflict  with  each 
other.  Let  e\  and  62  be  the  write  events  u)2  and  wz, 
respectively.  If  their  write  values  are  similar  under 
the  similarity  bound  as  shown  in  Figure  1 ,  these  two 
write  events  are  strongly  similar  and  it  does  not  matter 
which  write  value  is  read  by  subsequent  read  events. 
Suppose  ei  and  are  respectively,  the  write  event 
IU2  and  the  read  event  r  in  Figure  1 .  For  their  relative 
ordering  to  be  unimportant,  there  must  exist  an  earlier 
write  event  whose  write  value  is  similar  to  the  write 
value  of  W2  under  the  similarity  bound.  If  this  is  the 
case,  as  is  shown  in  Figure  1,  then  it  does  not  matter 
which  write  value  the  read  event  r  reads.  The  same 
argument  applies  to  the  case  where  ei  and  €2  are  a 
read  event  and  a  write  event,  respectively. 


3.2  A  Sufficient  Condition:  Timing  Con¬ 
straints  vs  Data  Similarity 

Suppose  sbi  is  a  similarity  bound  for  a  data  object 
Xi-  Any  two  writes  on  i,  within  an  interval  shorter 
than  sbi  are  interchangeable  because  they  are  strongly 
similar.  Let  and  be  the  maximum, 

the  second  largest,  and  the  minimum  periods  of  trans¬ 
actions  updating  r,,  respectively.  If  there  is  only  one 
transciction  updating  Xi,  then  is  equal  to  p)"'" 

and  p"*‘.  Suppose  pt  is  the  maximum  period  of  trans¬ 
actions  reading  In  the  following,  we  shall  derive  a 
sufficient  condition  which  guarantees  the  “strong  sim¬ 
ilarity”  of  any  concurrently  executing  transaction  in¬ 
stances. 

For  simplicity  of  discussion,  we  assume  in  this  paper 
that  the  deadline  of  a  transaction  instance  is  equal  to 
the  end  of  its  period.  Extension  of  our  results  to  relax 
this  restriction  is  straightforward. 

Write  vs  Write  Condition,  (pj"”*  +  p"^^)  <  sbi 

By  our  definition  of  strong  similarity,  two  con¬ 
flicting  write  events  are  interchangeable  if  they  are 
strongly  similar.  In  order  words,  conflicting  write 
events  of  any  overlapping  transaction  instances  are  in¬ 
terchangeable  if  these  write  events  are  strongly  similar. 
(We  say  that  two  transaction  instances  overlap  if  their 
execution  overlap  in  time.)  If  no  transaction  misses 
its  deadline,  the  maximum  temporal  distance  between 
any  two  conflicting  write  events  of  overlapping  trans¬ 
action  instances  on  data  object  x,-  is  (pI"“*+pP’'*).  Ob¬ 
viously,  if  (p|"“*  +pP’^')  <  sbi,  conflicting  write  events 
of  any  overlapping  trans8u;tion  instances  are  strongly 
similar  and  interchangeable. 

Notice  that  the  Write  vs  Write  condition  for  data 
object  Xi  can  be  ignored  if  there  is  only  one  transaction 
updating  x,-.  This  is  because  no  two  instances  of  the 
same  transeiction  will  overlap  if  the  transaction  never 
misses  its  deaulline.  □ 

Read  vs  Write  Condition:  (p["‘'^-t-2pP""-t-p-')  <  s6, 

Suppose  r  is  a  transaction  with  period  p-"  and  reads 
data  object  x,  .  To  ensure  correctness,  conflicting  write 
events  which  might  be  read  by  an  instance  of  r  must 
be  strongly  similar  (thus  interchangeable)  so  that  any 
instance  of  r  will  not  block  or  be  blocked  by  transac¬ 
tion  instances  which  may  update  x,- . 

If  no  transaction  updating  Xi  misses  its  deadline, 
then  no  read  event  Cr  can  read  from  a  conflicting 
write  event  which  occurs  more  than  2pP*’"  ago.  Let 
this  oldest  write  event  be  called  write°‘'^  of  Cr-  For 
ease  of  argument,  we  assume  without  loss  of  general- 
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ity  that  the  initial  database  state  is  determined  by  a 
fictitious  set  of  write  events  so  that  an  oldest  write 
event  always  exists.  On  the  other  hand,  a  transaction 
instance  which  overlaps  with  the  transaction  instance 
issuing  Cr  may  issue  a  conflicting  write  event  almost 
later  than  the  end  of  the  period  of  the  trans¬ 
action  instance  issuing  Cr.  Let  this  write  event  be 
of  Cr-  Obviously,  this  transaction  instance 
of  T  (which  issues  e^)  should  not  block  or  be  blocked 
by  any  transaction  instance  because  of  read-write  ac¬ 
cess  conflict  on  x,-,  assuming  that  the  maximum  tem¬ 
poral  distance  of  write°^'^  and  of  Cr  is  no 

more  than  the  similarity  bound  sbi  of  x,.  In  order 
words,  read-write  access  conflict  of  x,  can  be  resolved 

if(pr'“-i-2pr”+pn<s*i- 

When  there  is  only  one  transaction  updating  x^, 
then  pj"®^  =  pj"'".  □  (In  the  last  case,  further  opti¬ 
mization  is  possible.) 

We  claim  that,  if  a  transaction  set  satisfies  the  Read 
vs  Write  and  Write  vs  Write  conditions,  then  these 
transactions  can  be  scheduled  independently  cis  if  they 
do  not  share  data  (with  the  usual  assumption  that 
individual  read  and  write  events  are  atomic).  Formal 
justification  of  this  claim  is  stated  in  Theorem  2  below. 

Suppose  two  schedules  jt  and  n'  have  the  same  event 
set  E,  A  and  A*  are  respectively  a  strong  similarity 
relation  and  a  regular  similarity  relation  for  both  it 
and  jt'.  We  say  that  n'  is  a  derived  schedule  of  n  if 
for  any  read  event  that  appears  in  it  and  tt',  the  two 
corresponding  write  events  in  it  and  it'  read  by  the 
read  event  are  strongly  similar  in  tt,  and  the  last  write 
events  which  update  the  same  data  object  in  it  and  it’ 
are  strongly  similar  in  tt. 

Theorem  1  [5,  7]  Suppose  two  schedules  tt  and  tt' 
have  the  same  event  set  E,  and  A,  are,  respec¬ 
tively,  a  strong  similarity  relation  and  a  regular  sim¬ 
ilarity  relation  for  both  tt  and  tt'  .  If  tt'  is  a  derived 
schedule  of  tt,  then  tt  and  tt'  are  view-similar  under 
A*,  i.e.,  TT  and  tt'  transform  similar  states  (under  A) 
into  similar  states  ( under  A^). 

Notice  that  view-similarity  is  an  extension  of  view 
equivalence  [5,  10].  A  schedule  is  view  A-serializable 
if  it  is  view  similar  to  a  serial  schedule. 

Theorem  2  If  a  transaction  set  satisfies  both  the 
Read  vs  Write  and  Write  vs  Write  conditions,  then 
any  schedule  that  satisfies  all  transaction  deadlines  is 
view  A-serializable. 

Proof.  The  proof  follows  directly  from  Theorem  1 
if  there  exists  a  serial  schedule  tt'  which  is  a  derived 


schedule  of  any  schedule  tt  that  satisfies  all  transac¬ 
tion  deadlines.  According  to  the  definition  of  "derived 
schedule”,  tt  and  tt'  must  satisfy  the  following  two  re¬ 
quirements;  (1)  all  write  events  read  by  the  same  read 
event  in  tt  and  tt'  must  be  strongly  similar  in  tt.  and 
(2)  the  last  write  events  on  every  data  object  in  tt  and 
tt'  must  be  strongly  similar  in  tt.  In  the  following, 
we  shall  prove  that  there  exists  a  sequence  of  event 
swaps  from  tt  to  some  serial  schedule  tt'  such  that  the 
requirements  of  a  derived  schedule  are  preserved  at 
every  step  in  the  sequence. 

Since  any  conflicting  write  events  of  overlapping 
transaction  instances  in  tt  are  strongly  similar  (ac¬ 
cording  to  the  Write  vs  Write  condition),  they  can 
be  swapped  in  any  way  without  violating  the  second 
requirement  of  a  derived  schedule.  Likewise,  a  con¬ 
flicting  read  event  and  a  conflicting  write  event  of  two 
overlapped  executing  transaction  instances  in  tt  can  be 
swapped  in  any  way  without  violating  the  first  require¬ 
ment  of  a  derived  schedule,  because  they  are  “strongly 
similar”  according  to  the  Read  vs  Write  condition.  In 
particular,  instances  of  all  write  events  on  the  same 
data  object  that  occur  in  any  interval  shorter  than  the 
similarity  bound  can  be  swapped  in  a  (untimed)  sched¬ 
ule  without  violating  consistency  requirements.  Thus, 
swapping  such  write  events  will  not  violate  the  first 
requirement  of  a  derived  schedule.  Therefore,  conflict¬ 
ing  events  of  overlapping  transeiction  instances  can  be 
swapped  in  any  order.  Since  non-conflicting  events  can 
also  be  swapped  in  any  order,  events  of  overlapping 
transaction  instances  can  be  swapped  in  any  order. 
In  other  words,  overlapping  transaction  instances  can 
be  serialized  in  any  order.  Also,  transaction  instances 
which  are  not  overlapped  in  tt  are  already  serialized. 
Therefore,  tt  can  be  serialized  by  swapping  events  of 
overlapping  transaction  instances  in  any  order.  □ 

3.3  Extensions 

Since  different  transactions  may  have  different  pre¬ 
cision  requirements  for  a  data  object,  the  Read  vs 
Write  and  Write  vs  IFrife  conditions  can  be  weakened. 
Suppose  sb'-  is  the  similarity  bound  of  a  data  object  x,- 
with  respect  to  a  transaction  r'.  The  Read  vs  Write 
condition  can  be  weaken  to;  (p["“*  -t-  -f-  p')  <  sb'- 

if  the  period  of  t  is  p'.  The  Write  vs  Write  condition 
can  be  weakened  to;  (pj""®  +p"**)  <  •sf'I- 

Finally,  we  consider  the  situation  where  some  trans¬ 
actions  satisfy  the  Read  vs  Write  and  Write  vs  Write 
conditions,  but  others  do  not.  In  this  case,  the  trans¬ 
action  system  cannot  be  scheduled  “fully”  indepen¬ 
dently.  A  simple  variation  of  Similarity  Stack  Protocol 
(SSP)  [6]  can  be  made  to  take  care  of  this  situation. 
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as  follows. 

As  in  SSP,  transactions  are  partitioned  into  inter¬ 
active  sets  such  that  no  two  transactions  in  different 
interactive  sets  may  share  any  data  object.  If  all  trans¬ 
actions  in  an  interactive  set  satisfy  the  Read  vs  Write 
and  Write  vs  Write  conditions,  the  recency  bound  of 
the  interactive  set  can  be  set  to  oo  such  that  transac¬ 
tions  in  the  interactive  set  can  be  scheduled  indepen¬ 
dently  of  one  another.  Here,  the  recency  bound  of  an 
interactive  set  limits  the  length  of  any  interval  spanned 
by  overlapping  transaction  instances  in  the  set.  If  any 
transaction  in  an  interactive  set  fails  any  one  of  the 
conditions,  the  recency  bound  of  the  interactive  set  is 
calculated  as  defined  in  [6].  The  correctness  of  this 
approach  can  be  justified  by  an  argument  similar  to 
the  last  section. 


4  Conclusion  and  Future  Research 

In  [4],  Graham  proposes  several  conditions  which 
are  sufficient  to  guarantee  that  a  transaction  system 
will  run  serializably  without  any  extra  effort  having 
to  be  taken.  Systems  satisfying  these  conditions  are 
said  to  achieve  serializability  /or  free.  The  conditions 
considered  by  Graham  are  determined  by  a  syntac¬ 
tic  check  on  the  transaction  programs,  and  are  in¬ 
dependent  of  the  semantics  of  data.  In  this  paper, 
we  take  a  semantic  approach  and  propose  a  sufficient 
condition  for  achieving  data  synchronization  for  free 
which  is  based  on  the  concept  of  data  similarity  [5]. 
Real-time  transactions  satisfying  this  condition  can 
be  scheduled  correctly  by  any  process  scheduling  dis¬ 
cipline  that  is  designed  for  the  independent  processes 
model  [8]  (e.g.,  RMS,  EOF)  where  no  locking  of  data  is 
assumed.  With  our  approach,  the  usually  high  utiliza¬ 
tion  factor  that  can  be  achieved  by  these  scheduling 
disciplines  is  also  attainable  for  transactions  satisfying 
our  condition. 

We  believe  that  there  are  many  interesting  research 
issues  concerning  the  concept  of  similarity.  To  gain 
experience,  it  is  important  to  investigate  how  to  con¬ 
struct  similarity  relations  systematically  from  appli¬ 
cation  specifications.  A  toolset  which  facilitates  rea¬ 
soning  about  similarity  relations  for  typical  real-time 
applications  should  be  very  useful. 
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